0% found this document useful (0 votes)
19 views17 pages

Pap Tza 2017

This document discusses models for music analysis using Markov logic networks (MLNs). It notes that most existing music analysis models focus on single attributes rather than holistic analysis, and fail to capture both uncertainty and complex relational structure. The document introduces MLNs as a framework that can encompass probabilistic and logical models for music analysis, handling both uncertainty and relational structure. It reviews existing music analysis approaches and discusses how MLNs relate to models like hidden Markov models and conditional random fields. Finally, it presents an application of MLNs for tonal harmony analysis.

Uploaded by

Un Stable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

Pap Tza 2017

This document discusses models for music analysis using Markov logic networks (MLNs). It notes that most existing music analysis models focus on single attributes rather than holistic analysis, and fail to capture both uncertainty and complex relational structure. The document introduces MLNs as a framework that can encompass probabilistic and logical models for music analysis, handling both uncertainty and relational structure. It reviews existing music analysis approaches and discusses how MLNs relate to models like hidden Markov models and conditional random fields. Finally, it presents an application of MLNs for tonal harmony analysis.

Uploaded by

Un Stable
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Models for music analysis from a Markov logic networks

perspective
H Papadopoulos, G Tzanetakis

To cite this version:


H Papadopoulos, G Tzanetakis. Models for music analysis from a Markov logic networks perspective.
IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics
Engineers, 2017, 25 (1), pp.19-34. �10.1109/TASLP.2016.2614351�. �hal-01742729�

HAL Id: hal-01742729


https://hal.archives-ouvertes.fr/hal-01742729
Submitted on 13 Jun 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
MANUSCRIPT T-ASL-05679-2016.R1 1

Models for Music Analysis from a Markov Logic


Networks Perspective
H. Papadopoulos and G. Tzanetakis

Abstract—Analyzing and formalizing the intricate mechanisms A. Towards a Unified Musical Analysis
of music is a very challenging goal for Artificial Intelligence.
Dealing with real audio recordings requires the ability to handle In the growing field of Music Information Retrieval (MIR),
both uncertainty and complex relational structure at multiple a fundamental problem is to develop content-based methods to
levels of representation. Until now, these two aspects have been enable or improve multimedia retrieval [3]. The exploration of
generally treated separately, probability being the standard way a large music corpus can be based on several cues such as the
to represent uncertainty in knowledge, while logical represen- audio signal, the score or textual annotations, depending on
tation being the standard way to represent knowledge and
complex relational information. Several approaches attempting a what the user is looking for. Metadata and textual annotations
unification of logic and probability have recently been proposed. of the audio content allow for searching based on specific
In particular Markov Logic Networks (MLNs) which combine requests such as the title of the piece or the name of the
first-order logic and probabilistic graphical models have attracted composer. When not looking for a specific request, but more
increasing attention in recent years in many domains. generally for some music pieces that exhibit certain musi-
This paper introduces MLNs as a highly flexible and expressive
formalism for the analysis of music that encompasses most of the cal properties, search engines are based on annotations that
commonly used probabilistic and logic-based models. We first describe the actual music content of the audio, such as the
review and discuss existing approaches for music analysis. We genre, the tempo, the musical key, and the chord progression.
then introduce MLNs in the context of music signal processing Manual annotation of the content of musical pieces is a very
by providing a deep understanding of how they specifically difficult, time-consuming and tedious process that requires a
relate to traditional models, specifically Hidden Markov Models
and Conditional Random Fields. We then present a detailed huge amount of effort. It is thus essential to develop techniques
application of MLNs for tonal harmony music analysis that for automatically extracting musical information from audio.
illustrates the potential of this framework for music processing. Although there have been considerable advances in music
Index Terms—Statistical Relational Learning, Markov Logic storage, distribution, indexation and many other directions
Networks, Hidden Markov Models, Conditional Random Fields, in the last decades, there are still some bottlenecks for the
Music Information Retrieval, Tonal Harmony, Chord, Key, Mu- analysis and extraction of content information. Music audio
sical Structure signals are very complex, both because of the intrinsic nature
I. I NTRODUCTION of audio, and because of the information they convey. Often
regarded as an innate human ability, the automatic estimation
T HE fascinating task of understanding how human be-
ings create and listen to music has attracted attention
throughout history. Nowadays, many research fields have con-
of music content information proves to be a highly complex
task, for at least two reasons.
verged to the particular goal of analyzing and formalizing the On the one hand, music signals are extremely rich and
complex mechanisms of music. The development of computer complex from a physical point of view, in particular because of
hardware technology has made possible the development of the many modes of sound production, of the wide range of pos-
Artificial Intelligence (AI) techniques for musical research in sible combinations between acoustic events, and also because
several directions such as composition, performance, music signal observations are generally incomplete and noisy. On
theory, and digital sound processing. The recent explosion the other hand, music audio signals are also complex from a
of online audio music collections and the growing demand semantic point of view: they convey multi-faceted and strongly
of listening to music in a personalized way have motivated interrelated information such as harmony, melody, metric, and
the development of advanced techniques for interacting with structure. For instance, chords change more often on strong
these huge digital music libraries at the song level. Using beats than on other positions of the metrical structure [4].
computers to model human analysis of music and to get insight Recent work has shown that the estimation of musical
into the intellectual process of music is a challenge that is attributes would benefit from a unified musical analysis that
faced by many research communities under various names considers the complex relational structure of music as well
such as Intelligent Audio Analysis [1], Machine Listening [2], as the context1 [5], [6]. Although there is a number of ap-
or Music Artificial Intelligence. proaches that take into account interrelations between several
Part of this research was supported by a Marie Curie International Outgoing dimensions in music (e.g. [7]), most existing computational
Fellowship within the 7th European Community Framework Program. models for music analysis tend to focus on a single music
H. Papadopoulos is with CNRS, Laboratoire des Signaux et Systèmes, 3 rue
Joliot-Curie, 91192 Gif-sur-Yvette, France (corresponding author to provide; attribute. This is contrary to the human understanding and
e-mail [email protected])
1
George Tzanetakis is with the Computer Science Department, University of For instance, the use of specific instruments can be established based on
Victoria, Victoria, B.C., V8P 5C2, Canada (e-mail [email protected]) knowledge of the composition period.
2 MANUSCRIPT T-ASL-05679-2016.R1

perception of music that is known to process holistically the artificial intelligence perspective. We then present MLNs as
global musical context [8]. In practice, most existing MIR a highly flexible and expressive formalism for the analysis
systems have a relatively simple probabilistic structure and of music audio signals that encompasses most currently used
are constrained by limited hypotheses that do not model probabilistic and logic-based models. Our research to date on
the underlying complexity of music. Dealing with real audio the use of MLNs for music analysis has shown that they offer a
recordings requires the ability to handle both uncertainty and very interesting alternative to the most commonly used hidden
complex relational structure at multiple levels of representa- Markov models as a more expressive and flexible, yet concise
tion. Existing approaches for music retrieval tasks typically model for content information extraction. We have proposed a
fail to capture these two aspects simultaneously. single unified MLN model for the joint estimation of chords
B. Statistical Relational Learning and Markov Logic and global key [22] and we have explored the use of MLNs
Real data such as music signals exhibit both uncertainty to integrate structural information to enhance chord estimation
and rich relational structure. Until recent years, these two [23]. Here, we aim to provide a deeper understanding of the
aspects have been generally treated separately, probability potential of MLNs for music analysis. Very few papers try
being the standard way to represent uncertainty in knowledge, to explain the deep-seated reasons why MLNs work. To be
while logical formalisms are the standard way to represent of real interest to the MIR community, we believe that an
knowledge and complex relational information. Music retrieval understanding of how they specifically relate to commonly
tasks would benefit from a unification of both representations. used models is needed. To this purpose, we first focus on the
As reflected by previous works, both aspects are important theoretical foundations of Hidden Markov Models (HMMs)
in music, and should be fully considered. However, tradi- and Conditional Random Fields (CRFs) and compare the
tional probabilistic graphical models and machine learning relative capabilities of these models in terms of formalism.
approaches are not able to cope with rich relational structure, This allows us to show how they can be elegantly and flexibly
while logic-based approaches are not able to cope with the embedded in a more general multilevel architecture with
uncertainty of audio and need a transcription step to apply MLNs, which offers new perspectives for music analysis.
logical inference on a symbolic representation. Within the music analysis area, we present an application
Appealing approaches towards a unification between logic for tonal harmony music analysis [24]. Here tonal harmony
and probability have emerged within the field of Statistical analysis is understood as segmenting and labeling an audio
Relational Learning (SRL) [9]–[11]. They combine first order signal according to its underlying harmony [25]. In traditional
logic, relational representations and logical inference, with computational models, it is not easy to express dependencies
concepts of probability theory and machine learning [12]. between different semantic and temporal levels. We design
Ideas from probability theory and statistics are used to address in the MLN framework a multi-level harmony description of
uncertainty while tools from logic, databases, and program- music, at the beat (chords), bar/phrase (local key, including
ming languages are introduced to represent structure. Many modulations) and global semantic structure time scales, in
representations in which statistical and relational knowledge which information specific to the various strata interact.
are unified within a single representation formalism have II. BACKGROUND
been proposed. They include relational Markov networks [13], Previous work on music content estimation can be classified
probabilistic relational models [14], probabilistic inductive into two main categories, probabilistic and logic-based models.
logic programming [15] or Bayesian logic programs [16]. We In the following section, specific emphasis will be given to
refer the reader to [11], [12] for a survey of SRL. applications related to tonal harmony analysis.
Among these approaches, Markov Logic Networks (MLNs)
A. Probabilistic vs. Logic for Music Processing
[17]–[19], which combine First-Order Logic (FOL) and prob-
abilistic graphical models (Markov networks) have received 1) Probabilistic Approaches for Music Content Analysis:
considerable attention in recent years. Their popularity is due Probabilistic graphical models form a large class of structured
to their expressiveness and simplicity for compactly represent- prediction models and are popular for MIR tasks that involve
ing a wide variety of knowledge and reasoning about data with predicting structured objects. In particular hidden Markov
complex dependencies. Multiple learning and inference algo- models [26] have been quite successful in modeling various
rithms for MLNs have been proposed, for which open-source tasks where objects can be represented as sequential phenom-
implementations are available (e.g. the Alchemy2 and ProbCog ena, such as chord [27] and local key [28] estimation, beat
3
software packages). MLNs have been successfully applied in tracking [29], note segmentation [30] and melody transcription
many domains and used for various tasks in AI, such as collec- [31]. The objects of interest are modeled as hidden variables
tive classification [20] and natural language processing [21]. that are inferred from some observations. For instance, in a
C. Goal of the Paper typical chord estimation HMM, the unknown chord progres-
sion is inferred from the observation of chroma vectors. An
A MLN is a statistical relational learning framework that
important limitation of HMMs is that it is hard to express
combines probabilistic inference with first-order logical rea-
dependencies in the data. Strong independence assumptions
soning. In this paper, we examine the current existing models
between the observation variables are made (e.g. each chroma
for music processing and discuss their limitations from an
observation is independent from the other etc.). A relevant
2 http://alchemy.cs.washington.edu musical description of audio would ideally consider multiple
3
http://ias.cs.tum.edu/research/probcog and typically interdependent observations.
MANUSCRIPT T-ASL-05679-2016.R1 3

Other formalisms that allow considering more complex the property of discriminative learning with few parameters
dependencies between data have been explored. Variants of versus generative learning with HMM is exploited. The other
HMMs, such as semi-Markov models can better model the possibilities of the framework, such as using richer features
duration distributions of the underlying events [32]. N-grams or modeling complex dependencies are not considered and the
can model long-range chord sequences without making the implemented model does not yield to results that outperform a
simplifying Markovian assumption, as in HMM-based ap- classic HMM. Very recently, CRFs have been applied to beat
proaches, that each chord symbol depends only on the pre- tracking [38], and to singing voice separation [39].
ceding one [33]. The tree structure presented in Paiement et Other audio tasks that can be seen as a labeling sequential
al. [7] allows building a graphical probabilistic model where data problem have been modeled in a CRF framework. Audio-
contextual information related to the meter is used to model to-score alignment has been the most extensive application
the chord progression in order to generate chords. Dynamic of CRFs in MIR [40], [41]. It has been shown that existing
Bayesian networks (DBN) allow the joint modeling of several models for this task can be reformulated with CRF of dif-
musical attributes [5]. ferent structures (semi-Markov CRF, Hidden CRF). The use
However, the use of graphical models that allow more com- of CRFs allows designing flexible observation functions that
plex dependencies than HMMs for music content estimation incorporate several features characterizing different aspects of
remains limited in the MIR field. HMMs belong to the class the musical content. The calculation of each state conditional
of Bayesian network models [34] that are used to represent probability is based on audio frames from an arbitrary past
the joint probability distribution p(y, x) between the hidden or future, improving the matching of a frame with a score
states y and the observations x, where both x and y are position. CRFs have been employed in automatic music tran-
random variables. HMMs are generative models in the sense scription [42] in a post-processing step to reduce single-frame
that they describe how the output (the hidden states y) proba- errors in a multiple-F0 transcription. They have also been used
bilistically generates the input (the observations x), the outputs in the context of audio-tagging [43], and musical emotion
topologically preceding the inputs. According to Baye’s rule, evolution prediction [44]. Finally, the ability of CRFs to use
the calculation of the conditional distribution p(y∣x) from multiple dependent features has also been exploited in the
p(y, x) requires to compute the marginal distribution p(x). symbolic domain, as for symbolic music retrieval [45] and
This requires enumerating all possible observation sequences, for the automatic generation of lyrics [46].
which is difficult when using multiple interdependent input 2) Logic-Based Approaches for Music Content Analysis:
features that result in a complex distribution. This generally A major advantage of the logic framework is that its ex-
leads to intractable models, unless the observation elements pressiveness allows modeling music rules in a compact and
are considered as independent from each other. But ignoring human-readable way, thus providing an intuitive description
these dependencies may impair the performances of the model. of music. For instance, background knowledge, such as music
In fact, in all the previously mentioned applications, the theory, can be introduced to construct rules that reflect the
observation sequence x is already known and visible in both human understanding of music [47]. Another advantage is
training and testing datasets. We are only interested in predict- that logical inference of rules allows taking into account all
ing the values of the hidden variables, given the observations. events including those which are rare [48]. Inductive Logic
A discriminative framework that directly models the condi- Programming (ILP) [49] refers to logical inference techniques
tional distribution p(y∣x) is thus sufficient. In discriminative that are a subset of FOL. These approaches combine logic
models, the assumptions of conditional independence between programming with machine learning. They have been widely
the observations and the current state that are posed for the used to model and learn music rules, especially in the context
HMMs are relaxed and there is no need to model the prior of harmony characterization and in the context of expressive
distribution over the input, p(x). This is in particular the case music performance. Approaches based on logic have focused
of Conditional Random Fields (CRFs) [35]. Many works have on symbolic representations, rather than on audio.
demonstrated that CRFs overcome several of the limitations of In the context of harmony characterization, pattern-based
HMMs and offer lot of potential for modeling linear sequence first-order inductive systems capable of learning new concepts
structures. In particular they offer attractive properties in terms from examples and background knowledge [50], or counter-
of designing flexible observation functions, with multiple point rules for two-voice musical pieces in symbolic format
interacting features, and modeling complex dependencies or [51] have been proposed. An inductive approach for learning
long-range dependencies of the observations. generic rules from a set of popular music harmonization ex-
CRFs have been successfully applied in various fields other amples to capture common chord patterns is described in [52].
than music audio signal processing, including natural language Some ILP-based approaches for the automatic characterization
processing, bioinformatics, computer vision and speech pro- of harmony in symbolic representations [53] and classification
cessing. There has been recently an increasing interest in using of musical genres [54] have been extended to audio [55].
CRFs for modeling music-related tasks, and we review here However, they require a transcription step, the harmony char-
these works. A tutorial on CRF in the context of MIR research acterization being induced from the output of an audio chord
can be found in Essid (2013) [36]. transcription algorithm and not directly from audio.
In the context of audio music content estimation, a first In the context of expressive music performance, algorithms
attempt to use CRFs is presented in Burgoyne et al. (2007) for discovering general rules that can describe fundamental
[37] for the purpose of chord progression estimation. Only principles of expressive music performance, such as rules
4 MANUSCRIPT T-ASL-05679-2016.R1

about tempo and expressive timing, dynamics and articulation, multiple occurrences (with possible variations) within the
have also been proposed [47], [56]–[58]. The inductive logic same musical piece. Previous works have revealed that the
programming approaches are not directly applied to audio, semantic structure can be used as a cue to obtain a “structurally
but on symbolic representations. This generally requires a consistent” mid-level representation of music. In the work of
transcription step, such as melody transcription [47]. Dannenberg [69], music structure is used to constrain a beat
tracking program based on the idea that similar segments of
B. Tonal and Harmony Music Content Estimation
music should have corresponding beats and tempo variation.
Since we present an application of MLNs for tonal harmony A work more closely related to this article is [70], in which
music analysis, we briefly review in this section existing the repetitive structure of songs is used to enhance chord
work on chord, key and structure estimation. The automatic extraction. A chromagram [71], [72] is extracted from the
estimation of each of these musical attributes by itself is an signal, and segments corresponding to a given type of section
important topic of study in the area of content estimation of are replaced by the average of the chromagram over all the
music audio signals. We review below only works that are instance of the same segment type over the whole song, so that
directly related to the proposed model. We refer the reader to similar structural segments are labeled with the exact same
[59]–[61] for recent reviews on each of these topics. chord progression. A limitation of this work is that it relies
1) Chord and Local Key Estimation: Harmony together on the hypothesis that the chord sequence is the same in all
with rhythm are two of the faces of Western tonal music that sections of the same type. However, repeated segments are
have been investigated for hundreds of years [62]. Harmony often transformed up to a certain extent and present variations
is structured at different time-scales (beat, bar, phrase-level, between several occurrences. Moreover, in the case that one
sections, etc.). Pitches are governed by structural principles segment of the chromagram is blurred (e.g. because of noise
and music is organized around one or more stable reference or percussive sounds), this will automatically affect all same
pitches. A chord is defined as combination of pitches, and the segments, and thus degrade the chord estimation.
system of relationships between pitches corresponds to a key.
A key, as a theoretical concept, implies a tonal center that is the III. MLN S AND THEIR RELATIONSHIP TO P ROBABILISTIC
most stable pitch called the tonic, and a mode (usually major G RAPHICAL M ODELS
or minor). A piece of music generally starts and ends in a In this section, we introduce Markov logic networks for
particular key referred to as the main or global key of the piece. music signal processing and clarify the relationship of MLNs
However, it is common that the composer moves between to both HMMs and CRFs. As examined in Sec. II, HMMs
keys. A change between different keys is called a modulation. are the most common models used for music processing. In a
Western tonal music can be conceived of as a progression of a labeling context, a HMM can be viewed as a particular case of
sequence of key regions in which pitches are organized around CRF, which itself is a special case of Markov network. CRFs
one stable tonal center. Such a region is defined here as a local serve here as a bridge between HMMs and MLNs.
key, as opposed to the global key. Tonality analysis describes Notations: We will use the following notations. We consider
the relationships between the various keys in a piece of music. probability distributions over sets of random variables V =
Tonality and key are complex perceptual attributes, whose per- X ∪ Y , where X is a set of input variables that we assume are
ception depends on the listener’s level of music training. More- observed (X is a sequence of observations), and Y is a set of
over, numerous phenoma in a music piece (ambiguous key, ap- output variables that we wish to predict (the “hidden states”).
parent tonality, no tonality etc.) contribute to make the problem Every variable takes outcomes from a set V = X ∪ Y that
of local key estimation challenging, and little work has been can be either continuous or discrete. We focus on the discrete
conducted on this topic (see [28], [60] for more details). case in this paper. An arbitrary assignment to X is denoted
Chords and (local) keys reflect the pitch content of an audio by a vector x = (x1 , . . . , xN ). Given a variable Xn ∈ X, the
signal at different time-scales. They are intimately related, notation xn denotes the value assigned to Xn by x.When there
specific chords indicating a stable tonal center while a given is no ambiguity, we will use the notations p(y, x) and p(y∣x)
key implying the use of particular chords. Previous works have instead of p(Y = y, X = x) and p(Y = y∣X = x).
explored the idea of using chords to find the main key of a The extraction of music content information can be often
musical excerpt [63]–[66]. But the question of how the chord seen as a classification problem, in the sense that we wish to
and the key progressions can be jointly modeled and estimated assign a class or a label y ∈ Y (e.g. a chord label) that is
remains scarcely addressed. The few existing works on the not directly observed to an observation x ∈ X (e.g. a chroma
topic present serious limitations, as the analysis window size vector). Note that the x are generally fixed observations,
for key estimation is empirically set to a fixed value [67], rather than treated as random variables. We can approach
[68] (resulting in undetected key changes for pieces with a this problem by specifying a probability distribution to select
fast tempo and chord rather than key estimation for pieces the most likely class y ∈ Y we wish to predict for a given
with a slow tempo), or they do not fully exploit the mutual observation x ∈ X. In general, the set of variables X ∪ Y
dependency between chords and keys [28] (the local key is have a complex structure. A popular approach is to use a
estimated from a fixed chord progression). (probabilistic) graphical model that allows representing the
2) A Structurally Consistent Description of Muisc: Music manner in which the variables depend on each other. A graphi-
structure appears at several time scales, from musical phrases cal model is a family of probability distributions that factorize
to longer semantically meaningful sections that generally have according to an underlying graph. The idea is to represent
MANUSCRIPT T-ASL-05679-2016.R1 5

a distribution over a large number of random variables by a of labels and a second term corresponding to each observation
product of potential functions4 that each depend on only a with its parent label (see Fig 1 for an example):
smaller subset of variables. N
p(y, x) = ∏ p(yn ∣yn−1 )p(xn ∣yn ) (1)
In a probabilistic graphical model, there is a node for each n=1
random variable. The absence of an edge between two vari- where we assume an unconditional prior distribution over the
ables a and b means that they are conditionally independent starting state and for time n = 1 we write the initial state
given all other random variables in the model5. The concept distribution p(y1 ) as p(y1 ∣y0 ).
of conditional independence allows decomposing complex
B. Conditional Random Fields
probability distributions into a product of independent factors
(see Fig. 1 for an example). In many real-word schemes that involve relational data, in
Graphical models include many model families. There are particular in music, the entities to be classified are related
directed and undirected graphical models, depending on the to each other in complex ways and their labels are not
way the original probability distribution is factorized. Many independent. Moreover, any successful classification would
concepts of the theory of graphical models have been devel- need to rely on multiple highly interdependent features that
oped independently in different areas and thus have different describe the objects of interest. CRFs are generally better
names. Directed graphical models are also commonly known suited than HMMs to including rich, overlapping features and
a Bayesian networks and undirected models are also referred thus to represent much additional knowledge in the model7 . A
to as Markov random fields or Markov networks. CRF is a probabilistic model for computing the conditional
probability p(y∣x) of the output y given the sequence of
A. Hidden Markov Models observations x. A CRF may be viewed as a Markov network
Hidden Markov models [26], belong to the class of di- globally conditioned on the observations.
rected graphs, and are standard models for estimating a A Markov network is a model for the joint distribution of a
sequential phenomenon in music. They make strong inde- set of variables V = (V1 , V2 , . . . , Vn ) ∈ V [75]. It is composed
pendence assumptions between the observation variables to of an undirected graph G and a set of potential functions
reduce complexity. A HMM computes a joint probability φk . The graph has a node for each variable and there is a
p(y, x) between an underlying sequence of N hidden states potential function for each clique in the graph8. A potential
y = (y1 , y2 , . . . , yN ) and a sequence of N observations x = function is a non-negative real-valued function of the state of
(x1 , x2 . . . , xN ). A HMM makes two independence assump- the corresponding clique. A potential between connected nodes
tions. First, each observation variable xn is assumed to depend can be viewed as some correlation measure, but it does not
only on the current state yn . Second it makes the Markov as- have a direct probabilistic interpretation and its value is not
sumption that each state depends only on its immediate prede- restricted to be between 0 and 1. The joint distribution of the
cessor6 . A HMM is specified using 3 probability distributions: variables represented by a Markov network can be factorized
over the cliques of the network by:
● The distribution over initial states p(y1 );
1
● The state transition distribution p(yn ∣yn−1 ) to transit from p(V = v) = ∏ φk (v{k} ) (2)
Z k
a state yn−1 to a state yn ;
where, v is an assignment to the random variables V and v{k}
● The observation distribution p(xn ∣yn ) of an observation
is the state of the kth clique (i.e., the state of the variables that
xn to be generated by a state yn .
appear in that clique). Z, known as the partition function, is
given by Z = ∑v∈V ∏k φk (v{k} ).
CRFs can be arbitrarily structured (e.g. skip-chain CRFs
[76],semi-Markov CRFs [77], segmental CRF [78]). Here,
Fig. 1. Graphical model of a HMM describing p(y, x) for a sequence of we focus on the canonical linear-chain model introduced in
three input variables x1 , x2 , x3 and three output variables y1 , y2 , y3 . Because Lafferty et al. (2001) [35], that assumes a first-order Markov
of the conditional independence between variables, the model simplifies in: dependency between the hidden variables y (see Fig. 2).
p(x1 , x2 , x3 , y1 , y2 , y3 ) = p(y3 ∣y2 ) ⋅ p(y3 ∣x3 ) ⋅ p(y2 ∣y1 ) ⋅ p(y2 ∣x2 ) ⋅ p(y1 ) ⋅
p(y1 ∣x1 ). Both the observations and the hidden state are random variables
and thus represented as unshaded nodes.
The joint probability of a state sequence y and an obser-
vation sequence x factorizes as the product of conditional
distributions. In this directed graph model, each observation Fig. 2. Graphical model view of a linear chain-structured CRF. The single,
has a “parent label” and the joint probability of the sequence large shaded node corresponds to the entire fixed observation sequence x, and
factorizes into pairs of terms, one term corresponding to pairs the clear nodes correspond to the label variables of the sequence y.
For a linear-chain CRF, the cliques consist of an edge
4
Also referred to as factors function or local functions in the literature. between yn−1 and yn as well as the edges from those two
5
Formally, given a third random variable c, two random variables a and b
7
are conditionally independent if and only if p(a, b∣c) = p(a∣c)p(b∣c). Note For a discussion about the advantages and disadvantages of CRF vs. HMM,
that in contrast two random variables a and b are statistically independent if see for instance Murphy (2012) [34], chapter 19. For further reading on CRFs,
and only if p(a, b) = p(a)p(b). we recommend the tutorials [73] and [74].
6 8
This actually corresponds to the standard case of first-order HMMs. In an undirected graph, a clique is a set of nodes Ω forming a fully
Higher-order HMMs called k-order HMMs exist where the next state may connected subgraph: for every two nodes in Ω, there exists an edge connecting
depend on past k states. the two.
6 MANUSCRIPT T-ASL-05679-2016.R1

labels to the set of observations x (see Fig. 2). The probability


∑ wi,j ⋅ fi,j (yn−1 , yn , xn ) = log p(yn = j∣yn−1 = i)
trans trans
of a particular label sequence y given observation sequence x
can be factorized into a normalized product of strictly positive, i,j∈S

real-valued potential functions, globally normalized over the obs


Similar equations are obtained with wi,o = log p(xn = o∣yn =
entire sequence structure: 9
i) .We can then rewrite Eq. (1) as:
N N
p(y∣x) = 1
Z(x) ∏ Fn (x, y) (3) p(y, x) = ∏ p(yn ∣yn−1 )p(xn ∣yn )
n=1 n=1
The normalization factor Z(x) sums over all possible state N

sequences so that the distribution sums to 1. Fn (x, y) = ∏ exp ( log p(yn ∣yn−1 ) + log p(xn ∣yn ))
n=1
is a set of feature functions designed to capture use- N
ful domain information that have the form Fn (x, y) = ∏ exp ( ∑ wi,j ⋅ fi,j (yn−1 , yn , xn )
trans trans

exp(∑K k=1 λk fk (yn−1 , yn , x)), where fk are real-valued func-


= n=1 i,j∈S
(7)
+ ∑ ∑ wi,o ⋅ fi,o (yn−1 , yn , xn ))
obs obs
tions of the state, and λk is a set of weights. Eq. (3) writes:
i∈S o∈O
1 N K
p(y∣x) = ∏ exp ( ∑ λk fk (yn−1 , yn , x)) (4) We refer to a feature function generically as fk , where fk
Z(x) n=1 k=1 trans
ranges over both all of the fi,j obs
and all of the fi,o , and
N K
similarly refer to a weight generically as wk . The previous
where Z(x) = ∑ ∏ exp ( ∑ λk fk (yn−1 , yn , x)).
y n=1 k=1
equation writes:
In contrast to HMMs, the feature functions of CRFs can N K
p(y, x) = ∏ exp ( ∑ wk ⋅ fk (yn−1 , yn , xn ))
not only depend on the current observation but on observations n=1 k=1
from an arbitrary past or future for the calculation of each state
From the definition of conditional probability, we have:
probability. Feature functions can belong to any family of real-
p(y, x) p(y, x)
valued functions, but in general they are binary functions, and p(y∣x) = =
p(x) ∑ p(y, x)
we will focus on this case here. y
We also write the model in the case where the observations Denoting Z(x) = ∑ p(y, x), we finally obtain:
are restricted to a single frame xn , for convenience of future y N K
comparison to HMMs (see Fig. 3): p(y∣x) = 1
Z(x) ∏ exp ( ∑ wk ⋅ fk (yn−1 , yn , xn ))
N K n=1 k=1
p(y∣x) = 1
Z(x) ∏ exp ( ∑ λk fk (yn−1 , yn , xn )) N K
n=1 k=1 = 1
Z(x)
exp ( ∑ ∑ wk ⋅ fk (yn−1 , yn , xn )) (10)
N K n=1 k=1
= 1
Z(x)
exp ( ∑ ∑ λk fk (yn−1 , yn , xn )) (5) Eq. (10) defines the same family of distributions as Eq. (5).
n=1 k=1
For a labeling task, a HMM is thus a particular case of linear-
chain CRF for a suitable choice of clique potentials, where
each potential feature is either a state feature function or a
transition feature function. In practice, the main difference
between using HMMs and HMM-like CRFs lies in the way
Fig. 3. Graphical model of a HMM-like linear-chain CRF describing p(y∣x). the model parameters are learnt. In the case of HMMs they
Here, it is an undirected graphical model: compared to the HMM in Fig. 1,
the arrowheads of the edges have disappeared. The shaded nodes indicate that are learned by maximizing the joint probability distribution
the corresponding variables are observed and not generated by the model. p(x, y), while in the case of CRF they are learned by maxi-
C. HMM vs. CRF mizing the conditional probability distribution p(y∣x), which
avoids modeling the observations distribution p(x).
The joint distribution p(y, x) of a HMM can be viewed as
a CRF with a particular choice of feature functions. We now For convenience (see Sec. IV-B3), we also write the condi-
describe how it is possible to translate a HMM into the feature tional probability of the CRF in its direct translation from a
functions and weights of a CRF. HMM from Eq. (7) as a product of factors:
For each state transition pair (i, j), i, j ∈ S (where S
N
p(y∣x) = 1
Z(x) ∏ Φtrans (yn−1 , yn , xn )Φobs (yn , xn ) (11)
represents a set of hidden states) and each state-observation n=1
pair (i, o), i ∈ S, o ∈ O (where O represents the set of the where Φtrans (yn−1 , yn , xn ) and Φobs (yn , xn ) are exponential
observations), let define a binary feature function of the form: family potential functions, respectively over state and obser-
trans
fi,j (yn−1 , yn , xn ) = 1(yn−1 = i) ⋅ 1(yn = j) vation configurations, that are derived from the transition and
observation probabilities of the HMM. Here we have removed
obs
fi,o (yn−1 , yn , xn ) = 1(yn = i) ⋅ 1(xn = o) the unused variable yn−1 in the state-observation pairs.
where 1(x = i) denotes an indicator function of x that takes
D. Markov Logic Networks
the value 1 when x = i and 0 otherwise. In other words,
trans
fi,j (yn−1 , yn , xn ) returns 1 when yn−1 = i and yn = j, and 1) MLN Intuitive Idea: MLNs have been introduced by
0 otherwise. Richardson & Domingos [17] and are a combination of
Let also define the set of weights wi,j trans
= log p(yn = 9
For state-observation pairs, the variable yn−1 could be removed but we
j∣yn−1 = i). We have: keep it to stay consistent with the definition of linear-chain CRF (Eq. (3)).
MANUSCRIPT T-ASL-05679-2016.R1 7

Markov networks and first-order logic (FOL). A MLN is a set (equivalence); ∧: “and” (conjunction); ∨: “or” (disjunction);
of weighted FOL formulas10 , that can be seen as a template ⌝: (negation)) and quantifiers (the universals ∀: “for all”; ∃:
for the construction of probabilistic graphical models. We “there exists”).
present in this section a short overview of the main concepts In general, in Markov logic implementations, formulas are
of Markov logic, with specific examples from the modeling of converted to clausal form, also known as conjunctive normal
musical concepts. We refer the reader to Domingos & Lowd form (CNF) for automated inference. Every KB in FOL can be
(2009) [19] for a thorough review. converted to clausal form [82]. A clausal form is a conjunction
MLNs are meant to be intuitive representations of real- of clauses, a clause being a disjunction of literals.
world scenarios. In general, FOL formulas are first used to 3) MLN formal definition: A first-order KB can be seen
express knowledge. Then a Markov network is constructed as a set of hard constraints on the set of possible worlds: if
from the instantiation of these formulas. The knowledge base a world violates even one formula, it has zero probability. In
is transformed into a probabilistic model simply by assigning a real world scheme, logic formulas are generally, but not
weights to the formulas, manually or by learning them from always true. The basic idea in Markov logic is to soften these
data. Inference is then performed on the Markov network. constraints to handle uncertainty: when a world violates one
2) Definitions and Vocabulary: A Markov network, as formula in the KB, it is less probable than one that does not
presented in Sec. III-B, is a model for the joint distribution of violate any formulas, but not impossible. Markov logic allows
a set of variables V = (V1 , V2 , . . . , Vn ) ∈ V, often represented contradictions between formulas by weighting the evidence on
as a log-linear model with each clique potential replaced by both sides. The weight associated with each formula reflects
an exponentiated weighted sum of features of the state: how strong a constraint is, i.e. how unlikely a world is in which
1
p(V = v) = exp ( ∑ wj fj (v)) (12) that formula is violated. The more formula a possible world
Z j satisfies, the more likely it is. Tab. I shows a KB and its conver-
where Z is a normalization factor, and fj (v) are features of sion to clausal form, with corresponding weights in the MLN.
the state v. A feature may be any real-valued function of
the state, but here (and in the literature of Markov logic)11 , Definition 1 Formally, a Markov logic network L [17] is
we focus on binary features, fj (v) ∈ {0, 1}. The most direct defined as a set of pairs (Fi , wi ), where Fi is a formula in
translation from the potential function form Eq. (2) to the log- first-order logic and wi is a real number associated with the
linear form Eq. (12) is obtained with one feature corresponding formula. Applied to a finite set of constants C (to which the
to each possible state v{k} of each clique, with its weight being predicates appearing in the formulas can be applied), it defines
a ground Markov network ML,C as follows:
log (φk (v{k} )).
1) ML,C contains one binary node for each possible
In first-order logic, the domain of discourse is defined by a grounding of each predicate (i.e. each atom) appearing in L.
set of four types of symbols. Constants (e.g. CMchord (“C ma- The value of the node is 1 if the ground predicate is true, and
jor chord”), GMchord) represent objects in the domain; the set 0 otherwise.
of constants is here assumed finite12 . Variables (e.g. x,y) take
2) ML,C contains one feature fj for each possible ground-
objects in the domain as values. Predicates represent properties
ing of each formula Fi in L. The feature value is 1 if the
of objects (e.g. IsMajor(x), IsHappyMood(x)) and rela-
ground formula is true, and 0 otherwise. The weight wj of the
tions between them (e.g. AreNeighbors(x,y)). Functions
feature is the weight wi associated with the formula Fi in L.
represent mappings from tuples of objects to objects.
A predicate can be grounded by replacing its A MLN can be viewed as a template for constructing
variables with constants (e.g. IsMajor(CMchord), Markov networks: given different set of constants, it will
AreNeighbors(CMchord,GMchord)). A predicate takes produce different networks. Each of these networks is called a
as outputs either True (synonymous with 1) or False ground Markov network. A ground Markov network ML,C
(synonymous with 0). A ground predicate is called an atomic specifies a joint probability distribution over the set V of
formula or an atom. A positive literal is an atomic formula possible worlds, i.e. the set of possible assignments of truth
and a negative literal is the negation of an atomic formula. values to each of the ground atoms in V 13 . From Def. (1) and
A world is an assignment of a truth value (0 or 1) to each Eq. (12), the joint distribution of a possible world V given by
possible ground predicate. ML,C is:
A first-order knowledge base (KB) is a set of formulas 1
p(V = v) = exp ( ∑ wi ni (v)) (13)
in first-order logic, constructed from predicates using logical Z i
connectives (⇒: “if. . . then” (implication); ⇔: “if and only if” where the sum is over indices of MLN formulas and ni (v) is
10
First-order logic is also known as predicate logic because it uses
the number of true groundings of formula Fi in v (i.e. ni (v)
predicates and quantifiers, as opposed to propositional logic that deals with is the number of times the ith formula is satisfied by possible
simple declarative propositions and is less expressive. The adjective ”first- world V ), and Z = ∑ exp ( ∑ wi ni (v ′ )).
order” distinguishes first-order logic, in which quantification is applied only v ′ ∈V i
to variables, from higher-order logic in which quantification can be applied
to predicate and function symbols. For more details, see e.g. [79], [80].
11 13 The ground Markov network consists of one binary node for each possible
Also as in the case of CRFs presented in Sec. III-B.
12 grounding of each predicate. A world V ∈ V is a particular assignment of
Markov logic have originally been defined only for finite domains [17]
but have since been extended to infinite domains [81]. In this paper we are truth value (0 or 1) to each of these ground predicates. If ∣V ∣ is the number
only concerned with finite domains. of nodes in the network, there are 2∣V ∣ possible worlds.
8 MANUSCRIPT T-ASL-05679-2016.R1

TABLE I
E XAMPLE OF A FIRST- ORDER KB AND CORRESPONDING WEIGHTS IN THE MLN.
Knowledge First-order Logic formula Clausal Form Weight
A major chord implies a happy mood. ∀x IsMajor(x) ⇒ IsHappyMood(x) ⌝IsMajor(x) ∨ IsHappyMood(x) w1 = 0.5
If 2 chords are neighbors on the circle ∀x ∀y AreNeighbors(x, y) ⇒ (IsMajor(x) ⇔ IsMajor(y)) ⌝AreNeighbors(x, y) ∨ IsMajor(x) ∨ ⌝ IsMajor(y), w2 = 1.1
of fifths, either both are major chords ⌝AreNeighbors(x, y) ∨ ⌝ IsMajor(x) ∨ IsMajor(y) w2 = 1.1
or neither are.

Assumptions in practical applications: To ensure that the


number of possible worlds for ML,C is finite, and that the
MLN will give a well-defined probability distribution over
those worlds, three assumptions about the logical represen-
tation are typically made: different constants refer to different
objects (unique names), the only objects in the domain are
those representable using the constant and function symbols
(domain closure), and the value of each function for each tuple Fig. 5. Illustration of the grounding process of the ground Markov network
in Fig. 4. Adapted from [83].
of object is always a known constant (known functions). For
more details, see [17]. rs(GM,GM)). Some elements in V may correspond to the
Remark: MLNs are usually defined as log-linear models. same template formula Fi with different truth assignments,
However, Eq. (13) can be rewritten as a product of potential and ni (x) only counts the assignments which make
functions: Fi true. For instance, there are two groundings for
1 1
p(V = v) = exp ( ∑ wi ni (v)) = ∏ φi (v{i} )ni (v) (14) formula: ∀ x IsMajor(x) ⇒ IsHappyMood(x). For
Z i Z i v = (1, 1, 1, 0, 1, 1, 1, 1) where 1 is true and 0 is false,
with φi (v{i} ) = ewi . This shows that any discrete probabilistic n1 (x) = 1 because only IsMajor(CM)⇒ IsHappyMood(CM)
model expressible as products of potentials can be expressed gives true value while IsMajor(GM) ⇒ IsHappyMood(GM)
with a MLN. This includes Markov and Bayesian networks. does not. For detailed examples of the computation of joint
4) Example: Fig. 4 shows the graph of the ground Markov distribution of a possible world V from Eq. (13) in Markov
network defined by the two formulas in Tab. I and the constants logic, we refer the reader to Cheng et al. (2014) [84].
CMchord (CM) and GMchord (GM). The grounding process is
illustrated in Fig. 5. There are 3 predicates and 2 constants.
E. MLNs vs. HMM and CRF
They result in 8 nodes that are binary random variables
denoted by V , and that each represent a grounded atom. In a labeling task context, we know a priori which pred-
The graphical structure of ML,C follows from Def. (1): icates are evidence and which ones will be queried. The
● Each possible grounding of each predicate in Fi becomes
ground atoms in the domain can be partitioned into a set of
a node in the Markov network. Each node has a binary evidence atoms (observations) x and a set of query atoms y.
value: 1(“True”) or 0 (“False”). The conditional probability distribution of y given x is [18]:
1
● Each possible grounding of each formula becomes a p(y∣x) = exp ( ∑ wi ni (x, y)) (15)
feature in the Markov network. Z(x) i∈FY

● All nodes whose corresponding predicates appear in the where FY is the set of all MLN clauses with at least one
same formula form a clique in the Markov network. Each grounding involving a query atom and ni (x, y) is the number
clique is associated with a feature. of true groundings of the ith clause involving query atoms.
The Markov network grows as the number of constants and This can also be written:
formula groundings increases, but the number of the formulas 1
p(y∣x) = exp ( ∑ wi gi (x, y)) (16)
(or the templates) stays the same. Z(x) i∈GY

where GY is the set of ground clauses in ML,C involving


query atoms, and gi (x, y) = 1 if the ith ground clause is true
in the data and 0 otherwise.
Comparing Eq. (16) and Eq. (10), we can see that for a
labeling task, a HMM can be expressed in Markov logic by
producing a clause for each state-transition pair (i, j), i, j ∈ S
and each state-observation pair (i, o), i ∈ S, o ∈ O, and giving
Fig. 4. Ground Markov network obtained by applying the formulas in Tab. I a weigh wi,j = log p(yn = j∣yn−1 = i) and wi,o = log p(xn =
to the constants CMchord (CM) and GMchord (GM).
o∣yn = i) respectively. Eqs. (16) and (5) show that a linear-
For the Markov network in Fig. 4, a world is an chain CRF can be expressed in Markov logic by producing a
assignment of a truth value to each possible ground clause corresponding to each feature function of the CRF, with
predicate in V = (IsMajor(CM),IsMajor(GM),IsHappy- the same weight as the CRF feature. These graphical models
Mood(CM),IsHappyMood(GM),AreNeighbors(CM,CM),Are- can be specified very compactly in Markov logic using a few
Neighbors(CM,GM),AreNeighbors(GM,CM),AreNeighbo- generic formulas (see Section IV-C1).
MANUSCRIPT T-ASL-05679-2016.R1 9

F. MLNs, First-order Logic & Probabilistic Graphical Models corresponds to the foot-tapping rate. Beats are aggregated in
Markov Logic generalizes both first-order logic and most larger time units called measures or bars.
commonly-used statistical models. FOL is the special case of In the time space, chords and local keys can be viewed
MLNs obtained when all weights are equal and tend to infinity. respectively as local and more global elements of tonal har-
In addition to add flexibility in knowledge bases using weights, mony. In this paper, the chord and (local) key progressions
Markov logic allows contradictions between formulas. Other are estimated using a restricted chord lexicon composed of
interesting features include building a MLN by merging sev- I = 24 major (M) and minor (m) triads (CM, . . . , BM, Cm,
eral KBs, even if they are partly incompatible [19]. . . . , Bm), and considering 24 possible keys (CM key, . . . , BM
From a probabilistic point of view, Markov logic allows very key, Cm key, . . . , Bm key) based on the major and harmonic
complex models to be represented very compactly (e.g. only minor scales and 12 pitches that compose an octave range of
three formulas for a hidden Markov model). Any discrete prob- Western music. We will also reasonably assume that within a
abilistic models expressible as products of potentials can be given measure, there cannot be any key modulation.
expressed with a Markov logic network. MLNs thus generalize As mentioned in Sec. II-B, chords, keys and the semantic
most commonly-used probabilistic graphical models, which structure are highly interrelated. We propose a model for tonal
includes Markov networks and Bayesian networks14 . They harmony analysis of audio that takes into account (part of)
also facilitate the incorporation of rich domain knowledge that this complex relational structure. Our model allows a joint
can be combined with purely empirical learning, and allows estimation of local keys and chords. Moreover, following
reasoning with incomplete data [86], [87]. the idea of designing a “structurally consistent” mid-level
Several implementations of Markov logic exist that com- representation of music, we show that the MLN framework
prise a series of efficient algorithms for inference, as well allows incorporating prior structural information to enhance
as weight and structure learning. In practical applications, chord and key estimation in an elegant and flexible way.
MLNs distributions are typically unique in the sense that they Although a long-term goal is to develop a fully automatic
represent a large number of variables that have implicit/explicit model that integrates an automatic segmentation, we follow
dependencies between each other. In this context, algorithms the previous approach for “structurally consistent” analysis
that combine probabilistic methods with ideas from logical [70] and assume that the metrical and semantic structures are
inference have been developed. This is out of the scope of this known. The segmentation of the song in beats, downbeats and
article, but we refer the reader to Domingos & Lowd (2009) structure is given as prior information.
[19], Chapters 3 and 4 for a detailed description of these 2) Signal Processing Front-end: The front-end of all the
algorithms, and also to [88]–[90] for more recent reviews. models described in this section is based on the extraction of
chroma feature vectors that describe the signal. The chroma
vectors are 12-dimensional vectors that represent the intensity
IV. A PPLICATION : MLN M ULTI - SCALE T ONAL H ARMONY of the twelve semitones of the Western tonal music scale,
A NALYSIS regardless of octave. We perform a beat synchronous analysis
In this section, we instantiate Markov logic networks for and compute one chroma vector per beat15 .
music signal processing, within the context of tonal harmony 3) About the Model Parameters: In what follows, the
analysis, and show how they compare with probabilistic graph- parameters of the models are derived from expert knowledge
ical model commonly used in MIR. Starting from a classic on music theory. All considered models allow training but we
HMM for chord progression estimation, we then propose a left this aspect for future work. In Markov logic, the weights
new chord estimation model based on CRF that integrates can be learned either generatively or discriminatively. We refer
richer features. We show how these models can be translated the reader interested in MLNs learning to [19], [92], [93].
into Markov logic that offers further flexibility in terms of Note that in the more general case, the weights of a MLN
modeling the complex relational structure of music. We finally have no obvious probabilistic interpretation since they are
present a MLN that is able to model complex tonal and interpreted relative to each other when defining the joint
harmony relational structure at several time scales. probability function. The weight of a clause specifies the
probability of a world in which this clause is satisfied relative
A. Generalities
to a world in which it is not satisfied. According to the
1) Music Theory Foundations and Hypothesis: Musical heuristic discussed in [17], the weight of a formula F is the
elements are highly organized. At the highest level, when log odds between a world where F is true and a world where
listening to a piece of music, we can feel in general a structure F is false, other things being equal. However, if F shares
and divide the piece into several segments that are semantically variables with other formulas (as it is typically the case) this
meaningful, as for instance verse or chorus sections in a correspondence does not hold, as the weight of F is influenced
Western popular music song. These segments are in general not only by its probability, but also by the other formulas that
related to the metrical structure, which itself is a hierarchical share the same variable. We refer the reader to [83], [94], [95]
structure. The most salient metrical level, called the beat level for more details.
14 15
MLNs can also be applied to time-changing domains: dynamic Bayesian This is done by integrating a beat-tracker as a front-end of the system
networks can be equivalently modeled with MLNs [85]. Approaches such as [91]. As a matter of fact we consider half-beat and not beat locations, as it
[5] for music could thus be modeled by a MLN, then possibly enriched, e.g. was found to give better results because there are chord changes on half beats.
with longer-term dependencies, as in the application presented in Sec. IV-C3. For the sake of simplicity, we will nevertheless use the term “beat-level”.
10 MANUSCRIPT T-ASL-05679-2016.R1

As seen in Sec. III-E, when translating a HMM or a linear- current local key: fi,lkey
(yn−1 , yn , qn ) = 1 if qn = k l and yn = ci ,
chain CRF into a MLN, there is a one-to-one correspondence and 0 otherwise. The key templates values Tkey l
(i), can be
between probabilities and weights in the MLN. When making viewed as a correlation measure that indicates the likelihood of
the model more complex by adding new formulas, there a key qn is k l , l ∈ [1, 24], given that the underlying state yn is
may not be any longer a one-to-one correspondence between chord ci , i ∈ [1, 24], and are used as weights wi,l key
, i, l ∈ [1, 24]
weights and probabilities of formulas. This is why in general for the key features in the CRF.
the weights of a MLN are learned from the data. According 4) Inference in HMM and CRF: For the HMM and the
to [17], a good way to set the weights is to write down the linear-chain CRF, the most likely sequence over time can be
probability with which each formula should hold, treat these estimated in a maximum likelihood sense by decoding the
as empirical frequencies, and learn the weights from them. underlying sequence of hidden states y from the sequence of
B. HMMs vs. CRFs for Tonal Harmony observations x using the Viterbi decoding algorithm [26]17 :
1) Baseline HMM for Chord Estimation (HMMChord): We ŷ = argmax p(y∣x) (18)
y
consider here a model for chord estimation that will serve as TABLE II
a baseline for comparison with CFR and MLNs. We utilize

MLNMustiScale-PriorKey
the baseline model for chord estimation proposed in [96] that
MLN FOR JOINT CHORDS , LOCAL KEYS AND STRUCTURE
we briefly describe here.

MLNMustiScale
DESCRIPTION . T HE “ X ” ON THE LEFT INDICATE THE

MLNLocalKey
MLNPriorKey
Let ci , i ∈ [1, 24] denote the 24 chords of the chord lexicon.

MLNChord
PREDICATES AND RULES THAT ARE USED FOR EACH MODEL .

MLNStruct
We observe a succession of xn = on , n ∈ [0, N − 1] 12-
dimensional chroma vectors, n being the time index, and Predicate declarations
N being the total number of beat-synchronous frames of // Observed predicates:
x x x x x x Observation(chroma!,time)
the analyzed song. The chord progression is modeled as an x x LocKey (key!,time)
ergodic 24-state HMM with a hidden variable and a single x x x x x x Succbeat (time,time)
x x Samebar (time,time)
observable variable at each time step. Each hidden state x x x Succstruct (time,time)
yn , n ∈ [0, N − 1] is a chord ci , i ∈ [1, 24] of the lexicon // Unobserved predicates (querry):
x x x x x x State(chord! time)
and is observed through a chroma observation, with emission x x LocKey(key!, time)
probability pobsHMM (xn ∣yn ). A state-transition matrix based on
Weight Formula
C HORD RULES
musical knowledge that reflects chord transition rules is used x x x x x x Prior observation chord probabilities:
to model the transition probabilities ptrans HMM (yn ∣yn−1 ) .
16 w0chord State(CM,0)
⋯ ⋯
2) Baseline CRF for Chord Estimation (CRFChord): From w0chord State (Bm,0)
Sec. III-C, we equivalently model the previous HMM for x x x x x x
chroma
Probability that the chroma observation has been emitted by a chord:
wCM,chroma Observation(Chroma0 ,n) ∧ State(CM,n)
chord estimation by a linear-chain CRF where the obser- chroma
wC♯M,chroma
0
Observation(Chroma0 ,n) ∧ State(C♯M,n)
0
vations consist of a unique chroma feature, by using a set ⋯ ⋯

(yn−1 , yn , xn ) with weight


chroma
wBm,chroma Observation(ChromaN−1 ,n) ∧ State(Bm,n)
trans
of transition binary features fi,j x x x x x x
N −1
Probability of transition between two successive chords:
of wi,j = log(pHMM (yn = c ∣yn−1 = ci )), and a set
trans trans j trans
wCM,CM State(CM,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ State(CM,n2 )

(yn−1 , yn , xn ) with weights


trans
chroma wCM,C♯M State(CM,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ State(C♯M,n2 )
of observation features fi,o ⋯ ⋯
chroma
wi,o = log(pHMM (xn = on ∣yn = ci )).
obs trans
wBm,Bm State(Bm,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ State(Bm,n2 )
L OCAL KEY RULES
3) Enriched CRF for Chord Estimation (CRFPriorKey): x x x x Probability that the key observation has been emitted by a chord::
key
Some chords are heard as more stable within an established wCM
key
k ,CMk
LocKey(CMk ,n) ∧ State(CM,n)
wCM LocKey(C♯Mk ,n) ∧ State(CM,n)
tonal context [98]. Various key templates, which represent the k ,C♯Mk
⋯ ⋯
importance of each of the 24 triads within a given key have key
wBm ,Bm
k k
LocKey(Bmk ,n) ∧ State(Bm,n)
x x Prior observation key probabilities:
been proposed in the literature, as the set of 24 24-dimensional w0key LocKey(CMk ,0)
WMCR key templates Tkey l
, l ∈ [1, 24] proposed in [28]. Such ⋯ ⋯
w0key LocKey(Bmk ,0)
prior information about keys and chords relationship can x x Probability of transition between two successive keys:
be incorporated in the CRF through additional observation transKey
wCM ,CM
k k
LocKey(CMk ,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ LocKey(CMk ,n2 )
transKey
wCM ,C♯M LocKey(CMk ,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ LocKey(C♯Mk ,n2 )
features. At each time instant n an observation is written k

k

xn = (on , qn ), where on is a chroma feature and qn is a key x x Minimum local key length:
wkeyDur LocKey(CMk ,n1 ) ∧ Samebar (n2 ,n1 ) ∧ LocKey(CMk ,n2 )
feature. For the key features, we assume that, at each time ⋯ ⋯
step n, the current local key qn = k l , l ∈ [1, 24] is known. We wkeyDur LocKey(Bmk ,n1 ) ∧ Samebar (n2 ,n1 ) ∧ LocKey(Bmk ,n2 )
S EMANTIC STRUCTURE RULES
factorize the observation function of Eq. (11) in two terms: x x x Probability that similar segments have the same chord progression:
Φobs (yn , xn ) = Φobs (yn , on ) ⋅ Φobs (yn , qn ) (17) wstruct State(CM,n1 ) ∧ Succstruct (n2 ,n1 ) ∧ State(CM,n2 )
⋯ ⋯
Φobs (yn , on ) is computed as in the previous section. For wstruct State(Bm,n1 ) ∧ Succstruct (n2 ,n1 ) ∧ State(Bm,n2 )

Φobs (yn , qn ), we add an observation feature that reflects the C. MLN for Tonal Harmony Analysis
correlation measure between the chord being played and the
In this section, we start with a basic model for chord
16
This transition matrix was originally proposed in the context of key recognition that we progressively enrich with additional music
estimation [97], but has been used for chords in our previous work [28],
17
[96]. Chords and key are musical attributes related to the harmonic structure For HMM, we use the HMM Matlab toolbox [99]; for CRF, we use the
and can be modeled in a similar way. UGM Matlab toolbox [100].
MANUSCRIPT T-ASL-05679-2016.R1 11

TABLE III sections IV-B1 and IV-B2, can be equivalently modeled in the
MLN framework considering three generic formulas, given in

MLNMustiScale-PriorKey
Eqs. (19), (20), and (21), which reflect the constraints given
E VIDENCE FOR JOINT CHORD , LOCAL KEY AND by the three distributions defining the generative stochastic

MLNMustiScale
MLNLocalKey
MLNPriorKey

STRUCTURE DESCRIPTION . T HE “ X ” ON THE LEFT


process of the HMM. The three generic formulas are described
MLNChord

MLNStruct

INDICATE THE EVIDENCE PREDICATES THAT ARE


GIVEN TO EACH MODEL . in Tab. II in the section “Chord rules”.
Description of the predicates: To model the chord pro-
O BSERVATIONS
x x x x x x // A chroma vector is observed at each time frame: gression at the beat-synchronous frame level, we use an
Observation(Chroma0 ,0) unobservable predicate State(ci ,n), meaning that chord ci

(that is hidden) is played at frame n, and two observable
Observation(ChromaN-1 ,N-1)
x x x x x x // The temporal order of the frames is known:
ones, the predicate Observation(on ,n), meaning that we
Succbeat (1,0) observe chroma on at frame n, and the temporal predicate
⋯ Succbeat (n2 ,n1 ), meaning that n2 and n1 are successive
Succbeat (N-1,N-2)
A DDITIONAL PRIOR ( LOCAL ) KEY INFORMATION
frames. They are also used for evidence, see in Tab. III.
x x // Prior information about the key at each time instant is given Choice of the logical formulas: As detailed in [83], condi-
LocKey(CMk ,0) (If the key is CM at time instant 0) tional probability distributions p(b∣a) (“if a holds then b holds

LocKey(GMk ,N-1) (If the key is GM at time instant N-1)
with some probability”) are well represented using conjunc-
x x // Minimum local key length tions of the form log(p) a ∧ b that are mutually exclusive (in
(// Beats [0:3] belong to the same bar and are likely to be in the same key) any possible world, at most one of the formulas is true). In
Samebar (1,0)
Samebar (2,1)
practice, in MLN implementations, the set of formulas must
Samebar (3,2) also be exhaustive (exactly one of the formulas should be true
(// Beats [4:7] belong to the same bar and are likely to be in the same key) for every given world and every binding of the variables)18 .
Samebar (5,4)
Samebar (6,5)
For each of the three distributions of the HMM, we use
Samebar (7,6) mutually exclusive and exhaustive sets of formulas. This is
⋯ achieved in Tab. II using the symbol !. When the predi-
A DDITIONAL SEMANTIC STRUCTURE PRIOR INFORMATION
cate State(chord!,time) is declared, this means there is
x x x // Prior information about similar segments in the structure:
Succstruct (1,10) one and only one possible chord per time instant. In the
Succstruct (2,11) same way, because the observation predicate Observation

is declared as functional, if Observation(Chroma1 ,n)
dimensions and relational structure links. The structure of the is true at time instant n, Observation(Chroma0 ,n),
domain is represented by a set of weighted logical formulas. Observation(Chroma2 ,n), etc. is automatically false.
In addition to this set of rules, a set of evidence literals The prior observation probabilities are described using:
represents the observations and prior information. Given this
w0chord State(ci ,0) (19)
set of rules with attached weights and the set of evidence
literals, Maximum A Posteriori (MAP) inference is used to for each chord c , i ∈ [1, 24], and with
i
= log p(y0 = ci )
w0chord
infer the most likely state of the world. denoting a uniform prior distribution of chord symbols.
We first describe how the two structures HMMChord and The conditional observation probabilities are described
CRFPriorKey can be expressed in a straightforward way using using a set of conjunctions of the form:
a MLN. We then build a more complex model that incorporates chroma
wi,o Observation(on ,n) ∧ State(ci ,n) (20)
structural information at various time scales. For this, we
propose the use of some time predicates that indicate links for each combination of chroma observation on and chord ci ,
chroma
between time instants and thus that have time as argument. We and with the weights wi,o defined in Sec. IV-B2.
consider three time scales related to the semantic and metrical The transition probabilities are described using:
structures of a music signal: trans
wi,j State(ci ,n1 ) ∧ Succbeat (n2 ,n1 ) ∧ State(cj ,n2 )
● The micro-scale corresponds to the beat-level and is (21)
related to the chord progression. It is associated to the for all pairs of chords (ci , cj ), i, j ∈ [1, 24], and with the
Succbeat (time,time) time predicate that indicates two suc- trans
weights wi,j defined in Sec. IV-B2.
cessive beat positions; Evidence consists of a set of ground atoms that give chroma
● The meso-scale corresponds to the bar level and is observations corresponding to each frame, and the temporal
related to local key progression. It is associated to the succession of frames over time using the beat-level temporal
Samebar (time,time) time predicate that indicates frames predicate Succbeat . Evidence is described in Tab. III.
belonging to a same bar; b) Incorporating Prior Information About Key (MLNPri-
● The macro-scale corresponds to the global structure level. orKey): Prior key information can be incorporated in the
It is associated to the Succstruct (time,time) time predicate MLN model, equivalently than in the case of the model
that indexes structurally similar segments.
18
1) Beat-Synchronous Level: Chord Estimation: Indeed cases not mentioned in the set of formulas (e.g. not writing down
chroma
the formula Observation(Chroma1 , n) ∧ State(CM, n) with weight wCM )
a) Chord Recognition (MLNChord): The chord progres- obtain by default an implicit weight of 0. As a result ground formulas not
sion modeled by HMMChord, and consequently CRFChord of mentioned have a higher probability, since for pi ∈ [0, 1], 0 ≥ log pi .
12 MANUSCRIPT T-ASL-05679-2016.R1

CRFPriorKey described in Sec. IV-B3 by simply adding a new Prior structural information at the global semantic level is
template formula that reflects the impact of key features. incorporated using the time predicate Succstruct . The position
Assuming that, at each time instant, the current local key of segments of same type in the song is given as evidence (see
k l , k ∈ [1, 24] is known, LocKey is added as a functional pred- Fig. 6 for an example). Let K denote the number of distinct
icate in Tab. II (LocKey(key!,time)) and given as evidence segments. Each segment sk , k ∈ [1, K] may be characterized
in the MLN by adding in Tab. III the evidence predicates: by its beginning position (in frames) bk ∈ [1, N ], and its length
LocKey(kl ,0), LocKey(kl ,1), ⋯, LocKey(kl ,N-1) (22) in beats lk . For each pair of same segment type (sk , sk′ ), the
position of matching beat-synchronous frames (likely to be the
An additional rule about key and chord relationship is same chord type) is given as evidence19:
incorporated in the model. For each pair (k l , ci ) of key
k l , l ∈ [1, 24] and chord ci , i ∈ [1, 24], we add the rule: Succstruct (sk (bk ),s’k (bk′ )) ⋯ (26)
key l i Succstruct (sk (bk +lk -1),s’k′ (bk′ +lk′ -1))
wi,l LocKey(k ,n) ∧ State(c ,n) (23)
with values of the weights ∈ [1, 24] defined in Sec.
key
wi,l , i, l
IV-B3. This rule “translates” the CRF key observation features.
2) Bar level: Joint Estimation of Chords and Local Key
Fig. 6. Position of similar frames within a pair of same segments.
(MLNLocalKey): Using MLNs, the key can be estimated The following set of formulas is added to the Markov logic
jointly with the chord progression by simply removing the network to express the constraint that two same segments
evidence predicates about key listed in Eq. (22), and by con- should have a similar chord progression:
sidering the predicate LocKey as a query along with the pred-
icate State. LocKey(key!,time) becomes an unobservable wstruct State(ci ,n1 ) ∧ Succstruct (n2 ,n1 ) ∧ State(ci ,n2 )
predicate and local keys are estimated from the chords, under for all chord ci , i ∈ [1, 24], and with weight wstruct , reflecting
the assumption that a chord implies a tonal context. how strong the constraint is, manually set. In practice, wstruct
In addition to the template formula reflecting rules about will be a small positive value (in Sec. V wstruct = − log(0.95))
chords and key relationships, we add rules to model key to favor similar chord progressions in same segment types.
modulations in the same way that we add chord transition rules 4) A Multi-scale Tonal Harmony Analysis (MLNMulti-
(see Eq. (21)). For this, we use the following set of generic Scale): The two models MLNLocalKey and MLNStruct can
formulas, see Tab. II, Sec. “Local key rules”: be unified by simply combining all the formulas into the same
transKey
wij LocKey(k ,n1 )∧Succbeat (n2 ,n1 )∧LocKey(k ,n2 ) MLN. In this model, the chord and local key progressions
i j

(24) are jointly estimated relying on the metrical and the semantic
for all pairs of keys (k i , k j ), i, j ∈ [1, 24]. Key modulations are structure. In Sec. IV-A3, we mentioned that when adding
modeled similarly to chord transitions and we use wij transKey
= formulas that share variables with others, this has an influence
trans on the weights in the MLN, and it may be needed to modify
wij (see footnote 16).
We also add a rule to capture our hypothesis from Sec. them (in general by training). However, in our case, we
IV-A1 that key changes inside a measure are very unlikely. obtained good results by combining the rules about local key
We add evidence indicating frames belonging to a same bar and structure without changing the weights.
using the temporal predicate Samebar (n2 ,n1 ) (see Tab. III). 5) Inference in MLN: The inference step consists in
We include in the model the template formula: computing the answer to a query. Finding the most likely
l l state of the world y consistent with some evidence x
wkeyDur LocKey(k ,n1 )∧Samebar (n2 ,n1 )∧LocKey(k ,n2 )
(25) is generally known in Bayesian networks as Maximum
for each key k , l ∈ [1, 24], and with weight wkeyDur , reflecting Probability Explanation (MPE) inference, while in Markov
l

how strong the constraint is, manually set. In practice, wkeyDur networks it is known as Maximum A Posteriori (MAP)
is a positive value (in our experiments, wkeyDur = − log(0.95)) inference. In Markov logic, the problem of finding the most
to avoid key changes inside a measure. probable configuration of a set of query variables given
3) Global Semantic Structure Level (MLNStruct): Follow- some evidence reduces to finding the truth assignment that
ing the idea of designing a “structurally consistent” mid-level maximizes the sum of weights of satisfied clauses:
1
representation of music [69], we show that prior structural argmax p(y∣x) = argmax exp ( ∑ wi ni (x, y)) (28)
y y Z(x) i
information can be used to enhance chord and key estimation
in an elegant and flexible way within the framework of This problem is generally NP-hard. Both exact and approxi-
MLNs. As opposed to [70], we do not constrain the model mate weighted satisfiability solvers exist [19], [88], [101]. We
to have the exact same chord progression in all sections of use here exact inference with the toulbar2 branch & bound
the same type, but we only favor same chord progressions MPE inference [102] implemented in the ProbCog toolbox20.
for all instances of the same segment type, so that variations
19
Note that the values sk (bk ), . . . , s′k′ (bk′ +lk′ −1) in Eq. (26) correspond
to beat time-instants. Note also that here lk′ = lk .
between similar segments can be taken into account. Here, we 20
Although manageable on a standard laptop, the MLN inference step has
focus on popular music where pieces can be segmented into a high computational cost compared to the Viterbi algorithms for HMM and
specific repetitive segments with labels such as chorus, verse, CRF (≈ 2min for MLNChord against 6s for HMMChord for processing 60s
of audio on a MacBook Pro 2.4GHz Intel Core 2 Duo with 2GB RAM). We
or refrain. Segments are considered as similar if they represent plan to explore the use of approximate algorithms and also to take advantage
the same musical content, regardless of their instrumentation. of the current developments on scalable inference [88]–[90].
MANUSCRIPT T-ASL-05679-2016.R1 13

TABLE IV
C HORDS LABEL ACCURACY EE RESULTS .
Pop test-set Mozart test-set
To illustrate the flexibility of MLNs, we also tested a
HMMChord 74.02 ± 14.61 52.80 ± 6.04 scenario where some partial evidence about chords was added
CRFChord 74.14 ± 14.59 52.94 ± 5.69 by adding evidence predicates of the form State(cGT i , 0),
GT
CRFPriorKey 75.42 ± 14.10 53.59 ± 5.62 State(cGT GT
i , 9), State(ci , 19), ⋯, State(ci , N −1), as prior
MLNChord 74.02 ± 14.61 52.80 ± 6.04 i , i ∈ [1, 24].
information of 10% of the ground-truth chords cGT
MLNPriorKey 75.31 ± 13.48 53.41 ± 5.76 We tested this scenario on the Fall out boy song This ain’t a
MLNLocalKey 72.59 ± 14.68 52.62 ± 6.44 scene its an arms race for which the ChordMLN estimation
results are poor. They were increased from 60.5% to 76.2%,
V. E VALUATION AND D ISCUSSION
showing how additional evidence can easily be added and have
In this section, we analyze and compare the performances a significant impact.
of the various models on two test-sets of different music styles
C. Bar level - Key as Prior Information or Query
annotated by trained musicians originally proposed in [28].
TABLE V
A. Test-sets and Evaluation measures L OCAL KEYS EE EXACT AND ME MIREX ESTIMATION RATE .
The Mozart test-set consists of 5 movements of Mozart Pop test-set Mozart test-set
piano sonatas corresponding to 30 minutes of audio music. EE 54.12 ± 34.69 83.06 ± 19.38
MLNLocalKey
The Pop test-set contains 16 songs from various artists and ME 71.80 ± 35.08 90.61 ± 16.36
EE 61.31 ± 36.50 80.21 ± 13.56
styles that include pop, rock, electro and salsa. Details can be [28] best
ME 73.18 ± 27.56 84.81 ± 11.86
found in [28] and annotation are available on demand. As in
EE 59.90 ± 31.50
[103], we map the complex chords in the annotation (such as MLNMultiScale
ME 76.28 ± 28.19
major and minor 6th , 7th , 9th ) to their root triads.
Tonal analysis at the micro- and meso-scale rules (impact of 1) Prior Key Information: The MLN formalism incorpo-
chords and local key rules) is evaluated on both test-sets while rates prior information about key in a simple way with minimal
the incorporation of macro-scale rules (impact of semantic model changes. It improves in general the chord estimation re-
structure rules) is only evaluated on the Pop test-set since the sults (compare lines MLNChord and MLNPriorKey in Tab. IV).
considered scenario of incorporating semantic structure rules Fig. 7 shows an excerpt of the Pink Floyd song Breathe
is not relevant for classical music21 . in E minor key. In the first instance of the Verse, at [1:15-
For chord and key evaluation, we consider label accuracy, 1:20]min (dashed grey circle on measure D-1), the underlying
which measures how the estimated chord/key is consistent with Em harmony is disturbed by passing notes in the voice and
the ground truth. EE (Exact Estimation) results correspond estimated as EM with MLNChord. Prior key information favors
to the mean and standard deviation of correctly identified Em chords and removes this error in MLNPriorKey.
chords/keys per song. Parts of the pieces where no key can Note that the overall improvement is not statistically signif-
be labeled (e.g. when a chromatic scale is played) have icant for the Pop test-set, because the WMCR key templates
been ignored in the evaluation, and “Non-existing chords” are not adapted model chord/key relationships for some of the
(noise, silent parts or non-harmonic sounds) are uncondi- songs. A detailed discussion on the choice of relevant key
tionally counted as errors. For local key label accuracy, we templates according to the music genre (out of the scope of
also consider the ME score, which gives the estimation rate this article) can be found in [28].
according to the MIREX 2007 key estimation task 22 . 2) Local Key Estimation: By considering the key as a
Paired samples t-test at the 5% significance level are used to query (i.e. by simply removing the evidence predicates about
measure whether the difference in the results from one method key), the model can jointly estimate chords and keys. For our
to another is statistically significant or not. datasets, this does not help improving the chord estimation
results in average, which are even degraded for the Pop test-
B. Beat level - Equivalence With HMM, CRF and HMM
set (see the last line in Tab. IV). This is due to some special
The main interest of the proposed model lies in its simplicity musical cases. For instance, the Pink Floyd song Breathe
and expressivity for compactly encoding physical content and mainly consists of successive AM and Em chords. The correct
semantic information in a unified formalism. As an illustration key should be E dorian key, but modal keys are not modeled
of the theory, results show that the HMM and linear-chain here. The algorithm estimates A Major key almost all the time.
CRF structures can be concisely and elegantly embedded in a As a result, in the jointly estimated chord progression, most of
MLN. Although the inference algorithms used for each model the Em chords are labeled as EM chord (that are more likely
are different, a song by song analysis shows that chord pro- than Em chords in A Major key).
gressions estimated by the two models are quasi identical and Nevertheless, the key progression can be fairly inferred with
the difference in the results between HMMChord, CRFChord our algorithm. In Tab. V, we report the local key estimation
and MLNChord in Tab. IV is not statistically significant. results obtained with our algorithm and with a state-of-the art
21 algorithm tested on the same dataset [28]. In [28], the local
This would require a much more complex model since in classical
music, parts structurally similar often present strong variations such as key key is modeled with a HMM that takes as input either chord
modulations that would need to be taken into account. or chroma observations. From a modeling prospective, it is
22
The score is obtained using the following weights: 1 for correct key difficult to make a fair comparison with [28] because the ob-
estimation, 0.5 for perfect fifth relationship between estimated and ground-
truth key, 0.3 if detection of relative major/minor key, 0.2 if detection of servations, the key templates and the modeling hypothesis are
parallel major/minor key. Fore more details, see http://www.mirex.org. different from our MLN algorithm. In particular, in [28], the
14 MANUSCRIPT T-ASL-05679-2016.R1

metrical structure is explicitly taken into account in the model measures D-6:D-7 and D-18:D-19 of the two verses of Fig.
to infer the key progression. But to get an idea of the perfor- 7 (see the two dashed black rectangles), the position of the D-
mances of the proposed model against the state-of-the art, we 19:D-20 chord change is more accurate with MLNStruct than
report the best results of all configurations tested in [28]. with MLNChord, presumably because of the similarity with
Local key estimation results are especially high for the the D-7:D-8 chord change. Also it can be seen that the two
Mozart test-set and significantly better than in [28]. Results for MLNStruct chord progressions are not exactly the same (com-
the Pop test-set are not as satisfying, again because the WMCR pare measures D-6 and its counterpart D-18 in MLNStruct),
templates do not always reflect accurately the tonal content of which illustrates the flexibility of the proposed model (see the
pieces in this test-set23 . However, thanks to its flexibility, the four plain grey rectangles for another example). We expect that
MLN allows room for improvement. Indeed, when incorpo- music styles such as jazz music, where repetitions of segments
rating information about the structure (see MLNMultiScale in result in more complex variations due to improvisation would
Tab. V), key estimation results are comparable to the state-of- further benefit from the flexibility of the proposed model.
the-art results in [28], and even better for the MIREX score. TABLE VII
Note that local key estimation results without the rule on key C HORD EERESULTS OBTAINED FOR THE Pop test-set, WITH A
MULTI - SCALE TONAL HARMONY DESCRIPTION .
transitions (see Eq. (24) in Sec. IV-C2) turned out to be poor. MLNChord MLNPriorKey MLNLocalKey
Also, with the chosen parameter settings, the rule that disfavors 74.02 ± 14.61 75.31 ± 13.48 72.59 ± 14.68
key changes inside a measure using the predicate Samebar MLNStruct MLNMultiScale-PriorKey MLNMultiScale
(see the rule expressed by Eq. (25) in Sec. IV-C2) did not have 75.31 ± 15.62 76.31 ± 13.58 74.34 ± 14.32
a significant impact on the results probably because in rule (24) 2) Multi-scale Tonal Harmony Analysis: The combination
the weight in the clauses corresponding to transitions between of all the previously described rules results in a tonal harmony
two same keys is already high enough compared to those analysis at multiple temporal and semantic levels. This allows
corresponding to transition from a key to a different one24 . improving the analysis at both the micro and meso time scales.
D. Global Semantic Structure Level The chord progression estimated with MLNChord is signifi-
TABLE VI cantly improved with the MLNMultiScale-PriorKey. Moreover,
C HORD EE RESULTS WITH SEMANTIC STRUCTURE INFORMATION . Stat.
Sig.: STATISTICAL SIGNIFICANCE BETWEEN MLNStruct AND OTHERS .
the results of MLNMultiScale-PriorKey are also better than
EE Stat. Sig. both those obtained with MLNStruct and MLNPriorKey, which
MLNChord 74.02 ± 14.61 illustrates the benefit of using multiple cues for the analysis.
}yes
MLNStruct 75.31 ± 15.62 Also, as seen in Sec. V-C2, incorporating structure information
}no
[70] 74.44 ± 15.13 allows significantly improving local key estimation results.
1) Structurally Consistent Chord Progression Estimation: Moreover, some of the errors in the chord progression esti-
In Tab. VI, we compare the results of the model MLNStruct mated byMLNLocalKey are removed when incorporating rules
with the baseline MLNChord modified to account for the on structure information in MLNMultiScale (see Tab. VII).
structure in a similar way to [70], by replacing chromagram The two grey dashed rectangles in Fig. 7 (measures D-3
portions of same segments types by their average. The basis and D-15) illustrate the effect of the combined rules. With
signal features (chroma) are the same for both methods. MLNChord the position of the right boundary is not accurate
The proposed approach compactly encodes physical signal for chord D-3 but it is correct for chord D-15. Local key infor-
content and higher-level semantic information in a unified mation with MLNPriorKey is not sufficient to correct the D-3
formalism. Global semantic information can be concisely boundary. When similar chord progression in same segment
and elegantly combined with information at the beat-level types is enforced with MLNStruct, the model relies on chord
time-scale so that chord estimation results are significantly D-3 and the position change is incorrect for both instances. But
improved, and more consistent with the global structure, as when combining the two rules in MLNMultiscale, the position
illustrated in Fig. 7. For instance (see the plain black rectangle of chord change is correct for both instances.
measures D-1 and D-13), the ground-truth chord of the first VI. C ONCLUSION AND F UTURE W ORK
bar of the verse is Em. MLNChord correctly estimates the
second instance of this chord, but makes an error for the first In this article, we have introduced Markov logic as a
instance (EM instead of Em). This is corrected by MLNStruct formalism that enables intuitive, effective, and expressive
that favors same chord progression in same segment types. reasoning about complex relational structure and uncertainty
The results obtained with the proposed model fairly com- of music data. We have shown how MLNs relate to hidden
pare with the previous approach [70]. The difference is not Markov models and conditional random fields, two models
statistically significant, but the proposed model allows for that are typically used in the MIR community, especially
taking into account variations between segments by favoring for sequence labeling tasks. MLNs encompass both HMM
instead of exactly constraining the chord progression to be and CRF, while being much more flexible and offering new
the same for segments of the same type. For instance, in interesting prospects for music processing.
To illustrate the potential for music processing of Markov
23 In fact the results in [28] we report here for the Pop test-set are obtained
logic networks, we have progressively designed a model for
with other key templates (the Krumhansl key templates [98]). tonal harmony analysis, starting from a simple HMM. The
24
However, in our experiments, we saw that, as it could be expected,
decreasing the value of the wkeyDur weight would have the impact of favoring final proposed model combines harmony-related information
key changes inside a measure. at various time-scales (analysis frame, phrase and global
MANUSCRIPT T-ASL-05679-2016.R1 15

Fig. 7. Pink Floyd song Breathe. The first 4 lines (beats, downbeats, chords, structure) correspond to the ground truth annotations. The others indicate the
results obtained with the various proposed models. Analysis in the text: i) Dashed grey circle (comparison between MLNChord and MLNPriorKey), ii) plain
black rectangles (comparison between MLNChord and MLNStruct), iii) dashed black rectangles and plain grey rectangles (comparison of the flexibility of
MLNStruct versus [70]), iv) dashed grey rectangles (combination of all the rules in MLNMultiScale-PriorKey, structure and local key given as prior information).
structure) in a single unified formalism, resulting in a more [6] H. Papadopoulos, “Joint Estimation of Musical Content Information
elegant and flexible model, compared to existing more ad- From an Audio Signal,” Ph.D. dissertation, Univ. Paris 6, 2010.
[7] J. Paiement, D. Eck, S. Bengio, and D. Barber, “A graphical model for
hoc approaches. This work is a new step towards a unified chord progressions embedded in a psychoacoustic space,” in ICML,
multi-scale description of audio, and toward the modeling of 2005.
[8] J. Prince, S. M.A., and W. Thompson, “The effect of task and pitch
complex tasks such as music functional analysis. structure on pitch-time interactions in music,” Memory & Cognition,
The proposed model has a great potential of improvement in vol. 37, pp. 368–381, 2009.
[9] N. Nilsson, “Probabilistic logic,” Artif. Intell., vol. 28, pp. 71–87, 1986.
the future. Context information (metrical structure, instrumen- [10] J. Halpern, “An analysis of first-order logics of probability,” in IJCAI,
tation, chord patterns, etc.) could be compactly and flexibly vol. 46, 1989.
[11] L. Getoor and B. Taskar, Introduction to Statistical Relational Learn-
embedded in the model moving toward a unified analysis ing. The MIT Press, 608 p., 2007.
of music content. Here, relational structure has been derived [12] L. d. Raedt and K. Kersting, “Probabilistic inductive logic program-
from background musical knowledge. Learning from labelled ming,” in Probabilistic Inductive Logic Programming. Springer Berlin,
2008, vol. 4911, pp. 1–27.
examples might overcome some of the shortcomings of the [13] B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic
proposed model. The possibility of combining training with models for relational data,” in UAI, 2002.
[14] L. Getoor, “Learning statistical models from relational data,” Ph.D.
expert knowledge [86] may help leverage music complexity. dissertation, Stanford, 2001.
An appealing property of MLNs is their ability of construct- [15] N. L. and P. Haddawy, “Probabilistic logic programming and bayesian
networks,” in ASIAN, 1995.
ing new formulas by learning from the data and creating new [16] K. Kersting and L. Raedt, Towards Combining Inductive Logic Pro-
predicates by composing base predicates (predicate invention gramming with Bayesian Networks. Springer Berlin, 2001, vol. 2157,
pp. 118–131.
[104]). This should be of particular interest to the MIR [17] M. Richardson and P. Domingos, “Markov logic networks,” Mach.
community in the incoming years, considering the current Learn., vol. 62, pp. 107–136, 2006.
expansion of annotated databases. As more and more comple- [18] P. Domingos, S. Kok, H. Poon, M. Richardson, and P. Singla, “Unifying
logical and statistical ai,” in AAAI, 2006.
mentary heterogeneous sources of music-related information [19] P. Domingos and D. Lowd, Markov Logic: An Interface Layer for
are becoming available (e.g. video, music sheets, metadata, Artificial Intelligence. Morgan & Claypool, San Rafael, 2009.
[20] R. Crane and L. McDowell, “Investigating markov logic networks for
social tags, etc.), the development of multimodal approaches collective classification,” in ICAART, 2012.
for music analysis is becoming essential. This aspect should [21] S. Riedel and I. Meza-Ruiz, “Collective semantic role labelling with
markov logic,” in CoNLL, 2008.
strongly benefit from the use of statistical relational models. [22] H. Papadopoulos and G. Tzanetakis, “Modeling chord and key structure
MLNs are currently becoming more and more attractive with markov logic,” in ISMIR, 2012.
[23] ——, “Exploiting structural relationships in audio signals of music
to many research fields, leading to an increasing number of using markov logic,” in ICASSP, 2013.
compelling developments, including interesting connections [24] D. Radicioni and R. Esposito, “A conditional model for tonal analysis.”
with deep learning [105] and deep transfer [106]. We believe Springer Berlin, 2006, vol. 4203, pp. 652–661.
[25] D. Temperley, The Cognition of Basic Musical Structures. MIT,
that MLNs open new interesting perspectives for the field of Cambridge, 2001.
music content processing. [26] L. Rabiner, “A tutorial on HMM and selected applications in speech,”
Proc. IEEE, vol. 77, no. 2, 1989.
[27] A. Sheh and D. Ellis, “Chord segmentation and recognition using EM-
VII. ACKNOWLEDGMENT trained HMM,” in ISMIR, 2003.
[28] H. Papadopoulos and G. Peeters, “Local Key Estimation from an Audio
The authors would like to thank the anonymous reviewers Signal Relying on Harmonic and Metrical Structures,” IEEE Trans.
Aud., Sp. and Lang. Proc., 2011.
for their valuable feedback and suggestions for improvement. [29] A. Klapuri, A. Eronen, and J. Astola, “Analysis of the meter of acoustic
musical signals,” IEEE Trans. Aud., Sp. and Lang. Proc., vol. 14, no. 1,
pp. 342–355, 2006.
R EFERENCES [30] C. Raphael, “Automatic segmentation of acoustic musical signals using
[1] B. Schuller, “Applications in intelligent music analysis,” in Intelligent HMMs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 4, pp.
Audio Analysis. Springer Berlin, 2013, pp. 225–298. 360–370, 1999.
[2] R. Malkin, “Machine listening for context-aware computing,” Ph.D. [31] M. Ryynänen and A. Klapuri, “Transcription of the singing melody in
dissertation, Carnegie Mellon University, Pittsburgh, 2006. polyphonic music,” in ISMIR, 2006.
[3] P. Grosche, M. Müller, and J. Serrà, “Audio content-based music [32] A. Cont, “A coupled duration-focused architecture for real-time music-
retrieval,” in Multimodal Music Processing, 2012, pp. 157–174. to-score alignment,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32,
[4] M. Goto, “An audio-based real-time beat tracking system for music no. 6, pp. 974–987, 2010.
with or without drum sounds,” J New Music Res, vol. 30, no. 2, 2001. [33] R. Scholz, E. Vincent, and F. Bimbot, “Robust modeling of musical
[5] M. Mauch and S. Dixon, “Simultaneous estimation of chords and chord sequences using probabilistic n-grams,” in ICASSP, 2008.
musical context from audio,” IEEE Trans. Aud., Sp. and Lang. Proc., [34] K. Murphy, Machine Learning: A Probabilistic Perspective, ser. Adap-
vol. 18, no. 6, pp. 1280–1289, 2010. tive Computation and Machine Learning. MIT Press, 2012.
16 MANUSCRIPT T-ASL-05679-2016.R1

[35] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: [71] T. Fujishima, “Real-time chord recognition of musical sound: a system
Probabilistic models for segmenting and labeling sequence data,” in using common lisp music,” in ICMC, 1999.
ICML, 2001. [72] M. Bartsch and G. Wakefield, “To catch a chorus using chroma-based
[36] S. Essid, “A tutorial on conditional random fields with applications to representations for audio thumbnailing,” in WASPAA, 2001.
music analysis,” 2013, presented at ISMIR. [73] R. Klinger and K. Tomanek, “Classical probabilistic models and con-
[37] J. Burgoyne, L. Pugin, C. Kereliuk, and I. Fujinaga, “A cross validated ditional random fields,” Department of Computer Science, Dortmund
study of modeling strategies for automatic chord recognition in audio,” University of Technology, Tech. Rep. TR07-2-013, 2007.
in ISMIR, 2007. [74] C. Sutton and A. McCallum, “An introduction to conditional random
[38] T. Fillon, C. Joder, S. Durand, and S. Essid, “A conditional random fields,” Foundations and Trends in Machine Learning, vol. 4, no. 4,
field system for beat treaking,” in ICASSP, 2015. pp. 267–373, 2012.
[39] M. McVicar, R. Santos-Rodriguez, and T. De Bie, “Learning to separate [75] J. Pearl, Probabilistic reasoning in intelligent systems: Networks of
vocals from polyphonic mixtures via ensemble methods and structured plausible inference. Morgan Kaufmann, San Francisco, CA, 1988.
output prediction,” in ICASSP, 2016. [76] C. Sutton and A. McCallum, “An introduction to conditional random
[40] C. Joder, S. Essid, and G. Richard, “A conditional random field fields for relational learning,” in Introduction to Statistical Relational
viewpoint of symbolic audio-to-score matching,” in ACM, 2010. Learning, L. Getoor and B. Taskar, Eds. MIT Press, 2007.
[41] ——, “A conditional random field framework for robust and scalable [77] S. Sarawagi and W. Cohen, “Semi-markov conditional random fields
audio-to-score matching,” IEEE Trans. Aud., Sp. and Lang. Proc., for information extraction,” in NIPS, 2004.
vol. 19, no. 8, pp. 2385–2397, 2011. [78] G. Zweig and P. Nguyen, “A segmental CRF approach to large
[42] E. Benetos and S. Dixon, “Joint multi-pitch detection using harmonic vocabulary continuous speech recognition,” in ASRU, 2009.
envelope estimation for polyphonic music transcription,” IEEE J. Sel. [79] S. Haack, Philosophy of Logics. Cambridge Univ. Press, NY., 1978.
Topics Signal Process., vol. 5, no. 6, pp. 1111–1123, 2011. [80] D. Leivant, “Higher order logic,” in Handbook of Logic in Artificial
[43] Z. Duan, L. Lu, and C. Zhang, “Collective annotation of music from Intelligence and Logic Programming, Volume2, Deduction Methodolo-
multiple semantic categories.” in ISMIR, 2008. gies. Clarendon Press, 1994, pp. 229–322.
[44] E. Schmidt and Y. Kim, “Modeling musical emotion dynamics with [81] P. Singla and P. Domingos, “Markov logic in infinite domains,” in UAI,
conditional random fields,” in ISMIR, 2011. 2007.
[45] K. Sumi, M. Arai, T. Fujishima, and S. Hashimoto, “A music retrieval [82] M. Genesereth and N. N.J., Logical Foundations of Artificial Intelli-
system using chroma and pitch features based on conditional random gence. Morgan Kaufmann, San Mateo, CA, 1987.
fields,” in ICASSP, 2012. [83] D. Jain, “Knowledge engineering with markov logic networks: A
[46] A. Ramakrishnan, S. Kuppan, and S. Devi, “Automatic generation of review,” in DKB, 2011.
tamil lyrics for melodies,” in CALC, 2009. [84] G. Cheng, Y. Wan, B. Buckles, and Y. Huang, “An introduction to
[47] R. Ramirez, A. Hazan, E. Maestre, X. Serra, V. Petrushin, and L. Khan, markov logic networks and application in video activity analysis,” in
A Data Mining Approach to Expressive Music Performance Modeling. ICCCNT, 2014.
Springer London, 2007, pp. 362–380. [85] T. Papai, H. Kautz, and D. Stefankovic, “Slice normalized dynamic
[48] A. Anglade and S. Dixon, “Towards logic-based representations of markov logic networks,” in NIPS, 2012.
musical harmony for classification, retrieval and knowledge discovery,” [86] T. Pápai, S. Ghosh, and H. Kautz, “Combining subjective probabilities
in MML, 2008. and data in training markov logic networks,” in ECMLPKDD, 2012.
[49] S. Muggleton, “Inductive logic programming,” New Generat. Comput., [87] L. Snidaro, I. Visentini, and K. Bryan, “Fusing uncertain knowledge
vol. 8, pp. 295–318, 1991. and evidence for maritime situational awareness via markov logic
[50] E. Morales and R. Morales, “Learning musical rules,” in IJCAI, 1995. networks,” Information Fusion, vol. 21, no. 0, pp. 159–172, 2015.
[51] E. Morales, “Pal: A pattern-based first-order inductive system,” Mach. [88] R. Beltagy, I. Mooney, “Efficient markov logic inference for natural
Learn., vol. 26, no. 2, pp. 227–252, 1997. language semantics,” in AAAI, 2014.
[52] R. Ramirez and C. Palamidessi, Inducing Musical Rules with ILP. [89] D. Venugopal, “Scalable Inference Techniques for Markov Logic,”
Springer Berlin, 2003, vol. 2916, pp. 502–504. Ph.D. dissertation, University of Texas at Dallas, 2015.
[53] A. Anglade and S. Dixon, “Characterisation of harmony with inductive [90] S. Sarkhel, D. Venugopal, T. Pham, P. Singla, and V. Gogate, “Scalable
logic programming,” in ISMIR, 2008. Training of Markov Logic Networks Using Approximate Counting,” in
[54] A. Anglade, R. Ramirez, and S. Dixon, “Genre classification using AAAI, 2016.
harmony rules induced from automatic chord transcriptions,” in ISMIR, [91] G. Peeters and H. Papadopoulos, “Simultaneous beat and downbeat-
2009. tracking using a probabilistic framework: theory and large-scale eval-
[55] A. Anglade, E. Benetos, M. Mauch, and S. Dixon, “Improving music uation,” IEEE Trans. Audio, Speech, Lang. Proc., vol. 19, no. 6, 2011.
genre classification using automatically induced harmony rules,” J. New [92] P. Singla and P. Domingos, “Discriminative training of markov logic
Music Res., vol. 39, no. 4, pp. 349–361, 2010. networks,” in AAAI, 2005.
[56] G. Widmer, “Discovering simple rules in complex data: A meta- [93] D. Lowd and P. Domingos, “Efficient weight learning for markov
learning algorithm and some surprising musical discoveries,” Artif. logic networks,” in Knowledge Discovery in Databases: PKDD 2007.
Intell., vol. 146, no. 2, pp. 129–148, 2003. Springer Berlin, 2007, vol. 4702, pp. 200–211.
[57] M. Dovey, N. Lavrac, and S. Wrobel, Analysis of Rachmaninoff’s piano [94] J. Fisseler, “Toward markov logic with conditional probabilities,” in
performances using ILP. Springer Berlin, 1995, vol. 912, pp. 279–282. FLAIRS, 2008.
[58] E. Van Baelen, L. d. Raedt, and S. Muggleton, Analysis and prediction [95] M. Thimm, “Coherence and compatibility of markov logic networks,”
of piano performances using inductive logic programming. Springer in ECAI, 2014.
Berlin, 1997, vol. 1314, pp. 55–71. [96] H. Papadopoulos and G. Peeters, “Joint estimation of chords and
[59] M. McVicar, R. Santos-Rodriguez, Y. Ni, and T. de Bie, “Automatic downbeats,” IEEE Trans. Aud., Sp. and Lang. Proc., vol. 19, no. 1,
chord estimation from audio: A review of the state of the art,” IEEE pp. 138–152, 2011.
Trans. Aud., Sp. and Lang. Proc., vol. 22, no. 2, pp. 556–575, 2014. [97] K. Noland and M. Sandler, “Key estimation using a HMM,” in ISMIR,
[60] C.-H. Chuan and E. Chew, “The KUSC classical music dataset for 2006.
audio key finding,” IJMA, vol. 6, no. 4, pp. 1–18, 2014. [98] C. Krumhansl, Cognitive foundations of musical pitch. New York,
[61] M. Paulus, J. Müler and A. Klapuri, “State of the art report: Audio- NY, USA: Oxford University Press, 1990.
based music structure analysis,” in ISMIR, 2010. [99] K. Murphy, “HMM Toolbox for Matlab,”
[62] L. Euler, A attempt at a new theory of music, exposed in all clearness http://www.cs.ubc.ca/∼ murphyk/Software/HMM/hmm.html, 2005.
according to the most well-founded principles of harmony. Saint [100] M. Schmidt, “UGM: A matlab toolbox for
Petersburg Academy, 1739. probabilistic undirected graphical models,”

http://www.cs.ubc.ca/ schmidtm/Software/UGM.html, 2007.
[63] K. Noland and M. Sandler, “Signal processing parameters for tonality
estimation,” in AES, 2007. [101] F. Niu, C. Christopher Ré, A. Doan, and J. Shavlik, “Tuffy: Scaling up
[64] A. Shenoy, R. Mohapatra, and Y. Wang, “Key determination of acoustic statistical inference in markov logic networks using an rdbms,” PVLDB,
musical signals,” in ICME, 2004. vol. 4, no. 6, pp. 373–384, 2011.
[65] C. Raphael and J. Stoddard, “Harmonic analysis with probabilistic [102] D. Allouche, S. d. Givry, and T. Schiex, “Toulbar2, an open source
graphical models,” in ISMIR, 2003. exact cost function network solver,” INRA, Tech. Rep., 2010.
[66] C. Weiss, “Global Key Extraction From CLassical Music Audio [103] C. Harte, M. Sandler, S. Abdallah, and E. Gómez, “Symbolic repre-
Recordings Based on the Final Chord ,” in SMC, 2013. sentation of musical chords: a proposed syntax for text annotations,”
[67] B. Catteau, J. Martens, and M. Leman, “A probabilistic framework for in ISMIR, 2005.
audio-based tonal key and chord recognition,” in GFKL, 2007. [104] S. Kok and P. Domingos, “Statistical predicate invention,” in ICML,
[68] T. Rocher, M. Robine, P. Hanna, and L. Oudre, “Concurrent Estimation 2007.
of Chords and Keys From Audio,” in ISMIR, 2010. [105] H. Poon and P. Domingos, “Sum-Product Networks: A New Deep
[69] R. Dannenberg, “Toward automated holistic beat tracking, music anal- Architecture,” in UAI, 2011.
ysis, and understanding,” in ISMIR, 2005. [106] J. Davis and P. Domingos, “Deep transfer via second-order markov
[70] M. Mauch, K. Noland, and S. Dixon, “Using musical structure to logic,” in ICML, 2009.
enhance automatic chord transcription,” in ISMIR, 2009.

You might also like