Signal Processing Methods For Music Transcription Klapuri
Signal Processing Methods For Music Transcription Klapuri
Anssi Klapuri
Manuel Davy
Editors
Springer
Manuel Davy
Anssi Klapuri
LAGIS/CNRS
Tampere University of Technology
Ecole Centrale de Lille
Institute of Signal Processing
Cite Scientifique
Korkeakoulunkatu 1
BP48
33720 Tampere, Finland
59651 Villeneuve d'Ascq
Anssi.Klapuritut.fi
Cedex, France
[email protected]
ISBN-13: 978-387-30667-4
987654321
springer.com
Contents
Preface ix
List of Contributors xi
Part I Foundations
12 Singing Transcription
Matti Ryyndnen ! 361
12.1 Introduction 361
12.2 Singing Signals 364
12.3 Feature Extraction 368
12.4 Converting Features into Note Sequences 375
12.5 Summary and Discussion 390
References 391
Index 429
Preface
Foundations
Introduction to Music Transcription
Anssi Klapuri
t 0-1
Fig. 1.1. An acoustic musical signal (top) and its time-frequency domain represen-
tation (bottom). The excerpt is from Song G034 in the RWC database [230].
Fig. 1.2. Musical notation corresponding to the signal in Fig. 1.1. The upper staff
lines show the notation for pitched musical instruments and the lower staff lines
show the notation for percussion instruments.
C4
.
~ i
- C3
_
1= CI
^^ _ . ^ _ i
Fig. 1.3. A 'piano-roll' illustration of a MIDI file which corresponds to the pitched
instruments in the signal in Fig. 1.1. Different notes are arranged on the vertical
axis and time flows from left to right.
1 Introduction 5
Klapuri [354], and for beat tracking by Scheirer [564], for example. Another
prominent approach has been to model the human auditory scene analysis
(ASA) ability. The term ASA refers to the way in which humans organize
spectral components to their respective sounds sources and recognize simul-
taneously occurring sounds [49]. The principles of ASA were brought to the
pitch analysis of polyphonic music signals by Mellinger [460] and Kashino
[333], and later by Godsmark and Brown [215] and Sterian [609]. Most re-
cently, several unsupervised learning methods have been proposed where a
minimal number of prior assumptions are made about the analysed signal.
Methods based on independent component analysis [304] were introduced to
music transcription by Casey [70], [73], and various other methods were later
proposed by Lepain [403], Smaragdis [598], [600], Abdallah [2], [5], Virtanen
(see Chapter 9), FitzGerald [186], [188], and Paulus [505]. Of course, there
are also methods that do not represent any of the above-mentioned trends,
and a more comprehensive review of the literature is presented in the coming
chapters.
The state-of-the-art music transcription systems are still clearly inferior to
skilled human musicians in accuracy and flexibility. That is, a reliable general-
purpose transcription system does not exist at the present time. However,
some degree of success has been achieved for polyphonic music of limited
complexity. In the transcription of pitched instruments, typical restrictions
are that the number of concurrent sounds is limited [627], [122], interference of
drums and percussive sounds is not allowed [324], or only a specific instrument
is considered [434]. Some promising results for the transcription of real-world
music on CD recordings has been demonstrated by Goto [223] and Ryynanen
and Klapuri [559]. In percussion transcription, quite good accuracy has been
achieved in the transcription of percussive tracks which comprise a limited
number of instruments (typically bass drum, snare, and hi-hat) and no pitched
instruments [209], [505]. Also promising results have been reported for the
transcription of the bass and snare drums on real-world recordings, but this
is a more open problem (see e.g. Zils et al. [693], FitzGerald et al. [189],
Yoshii et al. [683]). Beat tracking of complex real-world audio signals can
be performed quite reliably with the state-of-the-art methods, but difficulties
remain especially in the analysis of classical music and rhythmically complex
material. Comparative evaluations of beat-tracking systems can be found in
[266], [349], [248]. Research on musical instrument classification has mostly
concentrated on working with isolated sounds, although more recently this
has been attempted in polyphonic audio signals, too [331], [33], [170], [647].
to discuss the perceptual attributes of sounds of which they consist. There are
four subjective quahties that are particularly useful in characterizing sound
events: pitch, loudness, duration, and timbre [550].
Pitch is a perceptual attribute which allows the ordering of sounds on
a frequency-related scale extending from low to high. More exactly, pitch is
defined as the frequency of a sine wave that is matched to the target sound
by human listeners [275]. Fundamental frequency (FO) is the corresponding
physical term and is defined for periodic or nearly periodic sounds only. For
these classes of sounds, FO is defined as the inverse of the period and is closely
related to pitch. In ambiguous situations, the period corresponding to the
perceived pitch is chosen.
The perceived loudness of an acoustic signal has a non-trivial connection
to its physical properties, and computational models of loudness perception
constitute a fundamental part of psychoacoustics [523].^ In music processing,
however, it is often more convenient to express the level of sounds with their
mean-square power and to apply a logarithmic (decibel) scale to deal with the
wide dynamic range involved. The perceived duration of a sound has more or
less one-to-one mapping to its physical duration in cases where this can be
unambiguously determined.
Timbre is sometimes referred to as sound 'colour' and is closely related to
the recognition of sound sources [271]. For example, the sounds of the violin
and the fiute may be identical in their pitch, loudness, and duration, but are
still easily distinguished by their timbre. The concept is not explained by any
simple acoustic property but depends mainly on the coarse spectral energy
distribution of a sound, and the time evolution of this. Whereas pitch, loud-
ness, and duration can be quite naturally encoded into a single scalar value,
timbre is essentially a multidimensional concept and is typically represented
with a feature vector in musical signal analysis tasks.
Musical information is generally encoded into the relationships between
individual sound events and between larger entities composed of these. Pitch
relationships are utihzed to make up melodies and chords. Timbre and loud-
ness relationships are used to create musical form especially in percussive
music, where pitched musical instruments are not necessarily employed at
all. Inter-onset interval (lOI) relationships, in turn, largely define the rhyth-
mic characteristics of a melody or a percussive sound sequence (the term lOI
refers to the time interval between the beginnings of two sound events). Al-
though durations of the sounds play a role too, the lOIs are more crucial in
determining the perceived rhythm [93]. Indeed, many rhythmically important
instruments, such as drums and percussions, produce exponentially decaying
wave shapes that do not even have a uniquely defined duration. In the case of
cToL F G A B
! ! ! ! ! ! ! ! ! !
Fig. 1.4. Illustration of the piano keyboard (only three octaves are shown here).
sustained musical sounds, however, the durations are used to control articu-
lation. The two extremes here are 'staccato', where notes are cut very short,
and 'legato', where no perceptible gaps are left between successive notes.
A melody is a series of pitched sounds with musically meaningful pitch
and lOI relationships. In written music, this corresponds to a sequence of
single notes. A chord is a combination of two or more simultaneous notes.
A chord can be harmonious or dissonant, subjective attributes related to the
specific relationships between the component pitches and their overtone par-
tials. Harmony refers to the part of music theory which studies the formation
and relationships of chords.
Western music arranges notes on a quantized logarithmic scale, with 12
notes in each octave range. The nominal fundamental frequency of note n can
be calculated as 440 Hz x 2^/^^, where 440 Hz is an agreed-upon anchor point
for the tuning and n varies from 48 to 39 on a standard piano keyboard,
for example. According to a musical convention, the notes in each octave are
lettered as C, C # , D, D # , E, F, . . . (see Fig. 1.4) and the octave is indicated
with a number following this, for example A4 and A3 referring to the notes
with fundamental frequencies 440 Hz and 220 Hz, respectively.
There are of course instruments which produce arbitrary pitch values and
not just discrete notes like the piano. When playing the violin or singing, for
example, both intentional and unintentional deviations take place from the
nominal note pitches. In order to write down the music in a symbolic form,
it is necessary to perform quantization^ or perceptual categorization [60]: a
track of pitch values is segmented into notes with discrete pitch labels, note
timings are quantized to quarter notes, whole notes, and so forth, and timbral
information is 'quantized' by naming the sound sources involved. In some
cases this is not necessary but a parametric or semi-symbolic^ representation
suffices.
An important property of basically all musical cultures is that correspond-
ing notes in different octaves are perceived as having a special kind of sim-
ilarity, independent of their separation in frequency. The notes C3, C4, and
C5, for example, play largely the same harmonic role although they are not
interchangeable in a melody. Therefore the set of all notes can be described
as representing only 12 pitch classes. An individual musical piece usually re-
cruits only a subset of the 12 pitch classes, depending on the musical key of
*In a MIDI file, for example, the time values are not quantized.
10 Anssi Klapuri
the piece. For example, a piece in the C major key tends to employ the white
keys of the piano, whereas a piece in B major typically employs all the black
keys but only two white keys in each octave. Usually there are seven pitch
classes that 'belong' to a given key. These are called scale tones and they pos-
sess a varying degree of importance or stability in the key context. The most
important is the tonic note (for example C in the C major key) and often a
musical piece starts or ends on the tonic. Perception of pitch along musical
scales and in relation to the musical key of the piece is characteristic to tonal
music, to which most of Western music belongs [377].
The term musical metre has to do with the rhythmic aspects of music: it
refers to the regular pattern of strong and weak beats in a piece. Perceiving the
metre consists of detecting moments of musical emphasis in an acoustic signal
and filtering them so that the underlying periodicities are discovered [404],
[93]. The perceived periodicities, pulses, at different time scales (or levels)
together constitute the metre, as illustrated in Fig. 1.5. Perceptually the most
salient metrical level is the tactus, which is often referred to as the foot-tapping
rate or the beat The tactus can be viewed as the temporal 'backbone' of a piece
of music, making beat tracking an important subtask of music transcription.
Further metrical analysis aims at identifying the other pulse levels, the periods
of which are generally integer multiples or submultiples of the tactus pulse.
For example, detecting the musical measure pulse consists of determining the
number of tactus beats that elapse within one musical measure (usually 2
to 8) and aligning the boundaries of the musical measures (bar lines) to the
music signal.
Another element of musical rhythms is grouping, which refers to the way in
which individual sounds are perceived as being grouped into melodic phrases;
these are further grouped into larger musical entities in a hierarchical manner
[404]. Important to the rhythmic characteristics of a piece of music is how
these groups are aligned in time with respect to the metrical system.
The structure of a musical work refers to the way in which it can be sub-
divided into parts and sections at the largest time-scale. In popular music,
for example, it is usually possible to identify parts that we label as the cho-
rus, the verse, an introductory section, and so forth. Structural parts can be
detected by finding relatively long repeated pitch structures or by observing
considerable changes in the instrumentation at section boundaries.
1 Introduction 11
The forthcoming chapters of this book address the extraction and analysis
of the above elements in musical audio signals. Fundamental frequency esti-
mation is considered in Parts III and IV of this book, with a separate treatise
on melody transcription in Chapters 11 and 12. Metre analysis is discussed
in Chapter 4 and percussion transcription in Chapter 5. Chapter 6 discusses
the measurement of timbre and musical instrument classification. Structure
analysis is addressed in Chapter 11, and the quantization of time and pitch
in Chapters 4 and 12, respectively. Before going to a more detailed outline
of each chapter, however, let us have a look at some general aspects of the
transcription problem.
1.2.1 Neurophysiological P e r s p e c t i v e
the notes, and finally to apply the rhythm [206, p. 6]. Obviously, ear training
presumes a normally hearing subject who is able to detect distinct sounds and
their pitch and timing in the played excerptsaspects which are very difficult
to model computationally.
Recently, Hainsworth conducted a study where he asked trained musicians
to describe how they transcribe realistic musical material [263]. The subjects
(19 in total) had transcribed music from various genres and with varying goals,
but Hainsworth reports that a consistent pattern emerged in the responses.
Most musicians first write down the structure of the piece, possibly with some
key phrases marked in an approximate way. Next, the chords of the piece or
the bass line are notated, and this is followed by the melody. As the last step,
the inner lines are studied. Many reported that they heard these by repeated
listening, by using an instrument as an aid, or by making musically educated
guesses based on the context.
Hainsworth points out certain characteristics of the above-described pro-
cess. First, it is sequential rather than concurrent; quoting the author, 'no-
one transcribes anything but the most simple music in a single pass'. In this
respect, the process differs from most computational transcription systems.
Secondly, the process relies on the human ability to attend to certain parts
of a polyphonic signal while selectively ignoring others.^ Thirdly, some early
analysis steps appear to be so trivial for humans that they are not even men-
tioned. Among these are style detection (causing prior expectations regarding
the content), instrument identification, and beat tracking.
^We may add also that the limitations of human memory and attention affect
the way in which large amounts of data are written down [602].
14 Anssi Klapuri
Fig. 1.6. Three different mid-level representations for a short trumpet sound (FO
260 Hz), followed by a snare drum hit. The left panel shows the time-frequency spec-
trogram with a logarithmic frequency scale. The middle panel shows the sinusoidal
model for the same signal, Une width indicating the amplitude of each sinusoid. The
right panel shows the output of a simple peripheral auditory model for the same
signal.
transcription where both linear [188] and logarithmic [505], [209] frequency
resolution has been used.
Another common choice for a mid-level representation in music transcrip-
tion has been the one based on sinusoid tracks [332], [440], [609], [652]. In
this parametric representation, an acoustic signal is modelled as a sum of
sinusoids with time-varying frequencies and amplitudes [449], [575], as illus-
trated in Fig. 1.6. Pitched musical instruments can be modelled effectively
with relatively few sinusoids and, ideally, the representation supports sound
source separation by classifying the sinusoids to their sources. However, this is
complicated by the fact that frequency components of co-occurring sounds in
music often overlap in time and frequency. Also, reliable extraction of the com-
ponents in real-world complex music signals can be hard. Sinusoidal models
are described in Chapter 3 and applied in Chapters 7 and 10.
In the human auditory system, the signal travelling from the inner ear to
the brain can be viewed as a mid-level representation. A nice thing about this
is that the peripheral parts of hearing are quite well known and computational
models exist which are capable of approximating the signal in the auditory
nerve to a high accuracy. The right panel of Fig. 1.6 illustrates this repre-
sentation. Auditory models have been used for music transcription by several
authors [439], [627], [434], [354] and these are further discussed in Chapter 8.
It is natural to ask if a certain mid-level representation is better than
others in a given task. Ellis and Rosenthal have discussed this question in
the light of several example representations commonly used in acoustic signal
analysis [173]. The authors list several desirable qualities for a mid-level rep-
resentation. Among these are component reduction^ meaning that the number
of objects in the representation is smaller and the meaningfulness of each is
higher compared to the individual samples of the input signal. At the same
1 Introduction 15
Musicological models,
sound source models
Transcription
Fig. 1.7. The two main sources of information in music transcription: an acoustic
input signal and pre-stored musicological and sound source models.
this are discussed in more detail in Part IV of this book. The term top-
down processing is often used to characterize systems where models at a high
abstraction level impose constraints on the lower levels [592], [172]. In bottom-
up processing, in turn, information flows from the acoustic signal: features
are extracted, combined into sound sources, and these are further processed
at higher levels. The 'unsupervised-learning' approach mentioned on p. 7 is
characterized by bottom-up processing and a minimal use of pre-stored models
and assumptions. This approach has a certain appeal too, since music signals
are redundant at many levels and, in theory, it might be possible to resolve this
'puzzle' in a completely data-driven manner by analysing a huge collection of
musical pieces in connection and by constructing models automatically from
the data. For further discussion of this approach, see Chapter 9.
Utilizing diverse sources of knowledge in the analysis raises the issue of
integrating the information meaningfully. In automatic speech recognition,
statistical methods have been very successful in this respect: they allow rep-
resenting uncertain knowledge, learning from examples, and combining diverse
types of information.
1.3 Outline
This section discusses the different subtopics of music transcription and sum-
marizes the contents of each chapter of this book. All the chapters are in-
tended to be self-contained entities, and in principle nothing prevents one
from jumping directly to the beginning of a chapter that is of special interest
to the reader. Whenever some element from the other parts of the book is
needed, an explicit reference is made to the chapter in question.
Part I Foundations
The first part of this book is dedicated to topics that are more or less related
to all areas of music trancription discussed in this book.
Chapter 2 introduces statistical and signal processing techniques that are
applied to music transcription in the subsequent chapters. First, the Fourier
transform and concepts related to time-frequency representations are de-
scribed. This is followed by a discussion of statistical methods, including
random variables, probability density functions, probabilistic models, and el-
ements of estimation theory. Bayesian estimation methods are separately dis-
cussed and numerical computation techniques are described, including Monte
Carlo methods. The last section introduces the reader to pattern recognition
methods and various concepts related to these. Widely used techniques such
as support vector machines and hidden Markov models are included.
Chapter 3 discusses sparse adaptive representations for musical signals.
The issue of data representations was already briefly touched in Section 1.2.3
above. This chapter describes parametric representations (for example the si-
nusoidal model) and 'waveform' representations in which a signal is modelled
as a linear sum of elementary waveforms chosen from a well-defined dictio-
nary. In particular, signal-adaptive algorithms are discussed which aim at
sparse representations, meaning that a small subset of waveforms is chosen
from a large dictionary so that the sound is represented effectively. This is
advantageous from the viewpoint of signal analysis and imposes an implicit
structure to the analysed signal.
The second part of this book describes methods for metre analysis, percussion
transcription, and pitched musical instrument classification.
18 Anssi Klapuri
Chapter 4 discusses beat tracking and musical metre analysis^ which con-
stitute an important subtask of music transcription. As mentioned on p. 10,
metre perception consists of detecting moments of musical stress in an audio
signal, and processing these so that 4;he underlying periodicities are discov-
ered. These two steps can also be discerned in the computational methods.
Measuring the degree of musical emphasis as a function of time is closely re-
lated to onset detection^ that is, to the detection of the beginnings of discrete
sound events in an acoustic signal, a problem which is separately discussed.
For the estimation of the underlying metrical pulses, a number of different
approaches are described, putting particular emphasis on statistical methods.
Chapter 5 discusses unpitched percussion transcription^^ where the aim
is to write down the timbre class, or the sound source, of each constituent
sound along with its timing (see Fig. 1.2 above). The methods discussed in
this chapter represent two main approaches. In one, a percussive track is as-
sumed to be performed using a conventional set of drums, such bass drums,
snares, hi-hats, cymbals, tom-toms, and so forth, and the transcription pro-
ceeds by detecting distinct sound events and by classifying them into these
pre-defined categories. In another approach, no assumptions are made about
the employed instrumental sounds, but these are learned from the input signal
in an unsupervised manner, along with their occurrence times and gains. This
is accomplished by processing a longer portion of the signal in connection and
by trying to find such source signals that the percussive track can be effec-
tively represented as a linear mixture of them. Percussion transcription both
in the presence and absence of pitched instruments is discussed.
Chapter 6 is concerned with the classification of pitched musical instru-
ment sounds. This is useful for music information retrieval purposes, and in
music transcription, it is often desirable to assign individual note events into
'streams' that can be attributed to a certain instrument. The chapter looks at
the acoustics of musical instruments, timbre perception in humans, and basic
concepts related to classification in general. A number of acoustic descriptors,
or features, are described that have been found useful in musical instrument
classification. Then, different classification methods are described and com-
pared, complementing those described in Chapter 2. Classifying individual
musical sounds in polyphonic music usually requires that they are separated
from the mixture signal to some degree. Although this is usually seen as a
separate task from the actual instrument classification, some methods for in-
strument classification in complex music signals are reviewed, too.
The term multiple FO estimation refers to the estimation of the FOs of several
concurrent sounds in an acoustic signal. The third part of this book describes
Many drum instruments can be tuned and their sound evokes a perception of
pitch. Here 'unpitched' means that the instruments are not used to play melodies.
1 Introduction 19
The fourth part of the book discusses entire music content analysis systems
and the use of musicological and sound source models in these.
Chapter 10 is concerned with auditory scene analysis (ASA) in music
signals. As already mentioned above, ASA refers to the perception of distinct
sources in polyphonic signals. In music, ASA aims at extracting entities like
notes and chords from an audio signal. The chapter reviews psychophysical
findings regarding the acoustic 'clues' that humans use to assign spectral
components to their respective sources, and the role of internal models and
top-down processing in this. Various computational approaches to ASA are
described, with a special emphasis on statistical methods and inference in
Bayesian networks.
20 Anssi Klapuri
Manuel Davy
If time alone or frequency alone are not enough to represent music, then
we need to think in terms of joint time and frequency representations (TFRs).
Western musical scores are actually TFRs with a specific encoding. Such rep-
resentations are introduced in Section 2.1.2, and rely on the concept of/mme,
that is, a well time-localized part of the signal. From these frames, we can de-
fine a simple TFR, the spectrogram, as well as a time and time-lag represen-
tation called cepstral representation; see Section 2.1.3. The following section
discusses some basic properties of Fourier Transforms.
The continuous and discrete FTs map the signal from the time domain to the
frequency domain; X[f) and X{k) are generally complex valued. The inverse
Fourier transforms (IFTs) are also quite useful for music processing; they are
defined in (2.3) and (2.4) below.
oo
An efficient approach to computing DFTs (2.1) and IDFTs (2.3) is the fast
Fourier transform (FFT) algorithm; see [531].
Some properties of the F T / I F T are of importance in music transcription
applications. In particular, the FT is a linear operation. Moreover, it maps
the convolution operation into a simple product.^ In other words, considering
^Readers interested in more advanced topics may refer to a dedicated book;
see [48] for example.
^The convolution operation is used for signal filtering: applying a filter with
time impulse response h{n) to a signal x{n) is done by convolving x with h or,
equivalently, by multiplying their FTs.
2 An Introduction to Signal Processing 23
This result is also true for the continuous-time FT, as well as for the inverse
transforms. Another important property, the Shannon theorem, states that
the range of frequencies where the discrete F T is meaningful has an upper
limit given by kg/2 (this is the Nyquist frequency, where /cg is the sampling
frequency). A straightforward consequence of the Shannon theorem is that
digital signals sampled at the CD rate k^ = 44100 Hz can be analysed up to
the maximum frequency 22050 Hz.
There exists only one Fourier transform of a given signal, as defined above.
However, there is an infinite number of time-frequency representations (TFRs).
The most popular one is the spectrogram, defined as the Fourier transform of
successive signal frames.^
Frames are widely used in audio processing algorithms. They are portions
of the signal with given time localizations. More precisely, the frame localized
at time to, computed with window w and denoted 5j^(r), is
^We restrict this discussion to continuous time signals, for the sake of simplicity.
Discrete time-frequency representations of discrete signals are, in general, more com-
plex to write in closed form, and they are obtained by discretizing the continuous
representations; see [191].
^A list of windows and reasons for choosing for their shapes can be found in [244],
[61].
24 Manuel Davy
Signal x(t)
Window w(^Q - 1 )
Frame s^ (t)
time t
Fig. 2.1. The frame SYQ is obtained by multiplying the signal x by a sliding window
w centred at time t.
Spectrograms are energy representations and they are defined as the squared
modulus of the STFT:
Changing the window w defines a different STFT and thus a different spec-
trogram. Any spectrogram can also be interpreted as the output of a bank
of filters: considering a given frequency / , S P ^ ( t , / ) is the instantaneous en-
ergy at time t of the output of the filter with frequency response given by
CFTw(/ i^), i^ ^ [00,00], applied to the signal x. As a consequence, a short
duration window leads to a spectrogram with good time resolution and bad
frequency resolution, whereas a longer window leads to the opposite situa-
tion.^ Figure 2.2 represents two spectrograms of a piano excerpt, illustrating
the influence of the window length on the representation.
Another interpretation of the STFT arises when considering (2.8) as the
dot product between x(r) and the windowed complex sinusoid w{tr)e~^'^^^'^:
under some conditions, the family of elementary time-frequency atoms {w(^
^jg-j27r/T j ^ ^ forms a basis called a Gabor basis. In such cases, STFTJJ" is a
decomposition of x on this basis, yielding a Gabor representation of x; see
Chapter 3 and [191], [184].
Spectrograms being energy representations, they are quadratic in the sig-
nal X. They also have the time-frequency covariance property: Let us shift x in
time and frequency, defining xi{t) x{t to) exp{j27r/o}. The time-frequency
covariance property ensures that SP^^ (t, / ) = SP^ (t - to, / - /o) Many other
^A TFR is said to have good time (frequency) resolution if the signal energy is
displayed around its true location with small spread along the time (frequency) axis.
The product of time resolution and frequency resolution is lower bounded via the
Heisenberg-Gabor inequality [191].
2 An Introduction to Signal Processing 25
Fig. 2.2. Spectrograms of a piano excerpt computed with Hamming windows with
length 20ms (left) and 80ms (right). The time-frequency resolution is highly depen-
dent on the window length.
and we see that filtering a signal in the frequency domain through the product
X{f)H{f) becomes, after taking the log, the addition log[X(/)] -f log[if(/)]
where H{f) is the filter frequency response. When dealing with discrete time
signals, the cepstrum is also discrete and is generally referred to in terms of
cepstral coefficients. Cepstral coefficients can be computed from the discrete
26 Manuel Davy
10 Lr*,!,^^,^^
fl
^
S 2
0.5 1.5
Time (s)
Fig. 2 . 3 . Time evolution of the first cepstral coefficients for the piano signal of
Fig. 2.2. The coefficient number is written along the vertical axis. For the sake of
figure clarity, the magnitude of the first cepstral coefficient has been divided by five,
and an offset has been added to each coefficient to avoid overlap.
For each mel filter, frequency components within its passband are weighted
by the magnitude response of the filter, and then squared and summed. The
resulting filter-related coefl&cient is denoted Xno(^mei)- For the full set of fil-
ters, the Xno(^mei)'s are stacked into a vector of size K^^i, whose logarithm
2 An Introduction to Signal Processing 27
Frequency
Fig. 2.4. Mel filter bank. Each filter has a triangular shape, and unity response
at its centre. Its edges coincide with the adjacent filters' central frequencies. The
central frequencies are linearly spaced on the mel frequency scale, which results in an
exponential interval between the filter centres onto the linear scale, through (2.12).
is transformed back into the time lag domain using the discrete cosine trans-
form ( D C T ) , where the D C T of a discrete signal x with length T is defined
as follows:
T
DCT^(i) = ^x{n)cos\-iin--j (2.13)
10g(Xno(^mel))
DCT logarithm
CepJi)^
differentiator
differentiator
Fig. 2.5. MFCC computation steps. The magnitude spectrum |5no(^)| of each
frame S^Q (n) is filtered through the mel filter bank. The squared outputs of each
filter are summed over the filter frequency range, yielding a coefficient Xno(^mei) for
each filter. The vector made of the logarithm of these coefficients is mapped back
to the time domain using the discrete cosine transform.
28 Manuel Davy
^^B(a) = ^ ] l B ( a ) , (2.14)
where /i(S) is called Lebesgue measure of B (it measures the volume of B) and
Is is the indicator function which satisfies I^(a) = 1 if a G B and I^(a) = 0
otherwise. Uniform pdfs are often met in audio processing problems, and they
are used to model the lack of precise information about a parameter.
Another important pdf in engineering problems is the Gaussian^ also called
normal pdf.
where /i is the mean vector^ and U is the covariance matrix, which is sym-
metric. The vector /i has the same size as a, and the matrix U is square, its
size in both dimensions being that of a. An exact definition of /x and U is
given in (2.19) and (2.20). An important special case is when a is scalar. In
this case, the Gaussian pdf is written as
The precise definition of random variables is beyond the scope of this chapter;
see [548] for a more precise introduction.
2 An Introduction to Signal Processing 29
where m is the mean and cr^ is the variance. There exist many other standard
continuous pdf shapes such as the gamma, inverse gamma, Cauchy, Laplace,
Dirichlet, etc. [544]. When the random variable a is a vector with dimension
da, a = [ a i , . . . , adj^, then the pdf p(a) is in fact a joint pdf p ( a i , . . . , ad^)-
This is a function of dimension da and it gives us the probability of observing
jointly ai, . . . , a^^.
Two very important properties of pdfs and discrete variable distributions
are that they are always positive and their sum equals one:
For any continuous pdf p(a), it is generally possible to define the mean vector
/i and covariance matrix U as follows:
/x = Ep(a) [a] and 27 = Ep(a) [(a -/LA)(a -/x)"*"] (continuous case). (2.25)
In t h e case where the random variable a is a vector of dimension c?a, its order
r moment A l ^ is an r-dimensional tensor with size c^a x da x . . . x da-
Finally, distributions generally have modes^ t h a t is, local maxima: their lo-
cations indicate in which regions of the space A the r a n d o m variable a is more
likely to appear. Figure 2.6 summarizes graphically t h e concepts introduced
in this subsection.
0.4
3.
a(2) .,4 : 4 1
ch-1
a(l)
One-dimensional Gaussian Two-dimensional Gaussian
which can be interpreted as the pdf of one of the random variable irrespective
of the value of the other one; see Fig. 2.6, right. Using marginal pdfs, it is
possible to decompose the joint pdf p(ai,a2) as follows:
where p(ai|a2) is the conditional pdf of a i , and should be read as 'the pdf of
ai conditional on a2'. Its interpretation is simple: Imagine that a2 has some
fixed, non-random value denoted a2'^''. Then ai is still a random variable,
and its pdf is p(ai,a2 = a2'^'') up to a normalizing constant. This constant is
necessary because the integral of p(ai,a2 = ^2'^'^) with respect to (w.r.t.) ai
does not equal one anymore. The notation p(ai|a2) should be understood as
p(ai|a2'^'^), that is, (1/C)p(ai,a2 = a2'^'^) where the normalizing constant C is
p(a^^^^) from (2.29).
Finally, two random variables ai and a2 are independent if and only if
p(ai,a2) = p(ai) p(a2) or, equivalently, p(ai|a2) = p(ai) or p(a2|ai) = p(a2).
This means that the knowledge of ai provides no information about a2, and
vice versa.
Random variables are extremely useful in signal processing because they can
model the lack of certainty about a physical phenomenon. Imagine we have a
good model for some process: for example, a 'pure sine' acoustic waveform gen-
erated by an electronic instrument. A model for the pressure signal recorded
is given by the following discrete time sine model:
where /CQ is the unknown sine waveform frequency, 0o is the initial phase, and
a is the signal amplitude. It is clear that the recorded pressure signal will not
fit exactly the model (2.30), and that it will deviate from it. As these deviations
may have various causes (air temperature/pressure inhomogeneity, non purely
sinusoidal loudspeaker behaviour when emitting the sound, digital to analog
conversion artifacts, etc.), it is unrealistic to model them deterministically,
and a random model can be used. A possible such model is
where e(n) is a so-called additive random noise with a given pdf. Including
this noise in the model is aimed at modelling the deviations of the recorded
data from the model in (2.30). In the general case, it is assumed that e(n) is
a stationary white noise, or independent identically distributed (i.i.d.) noise-,
that is, the noise at any time ni is statistically independent of the noise at
any time n2, and their pdfs are equal. More precisely, the joint pdf of the
noise samples equals the product of the pdfs of each sample: Writing the
noise samples as a vector e = [e(l),..., e(T)]^ for n = 1 , . . . , T, we have
p(e) = p{eil))p{e{2)).. . p ( e ( r ) ) , with p(e(l)) - p{e{2)) =...= p{e{T)).
Finally, since the noise is aimed at modelling small deviations around the sine
model, it is assumed to be zero-mean. Equation (2.31) defines a probabilistic
signal model and directly yields the likelihood function.
In (2.31), it is assumed that the recorded signal x follows a sine model with
additive zero-mean white noise. In practice, this model is interesting in the
sense that it relates the recorded signal x(n), n 1 , . . . , T to the parameters
fco, 00, and a. In the following, we denote by 6 the set of unknown parameters,
i.e., 6 = [ko,(l)o,a]. From the probabilistic model defined above, and given a
recorded signal x, we see that some values of 0 are more likely than others:
for example, if the sine wave is generated with frequency 440 Hz, finding ko =
440 Hz is very likely. It is also likely that the loudspeaker which emits the
sound adds partials (that is, additional sine waves with lower amplitudes)
at frequencies 880 Hz, 1320 Hz, etc. These are also hkely, to a lower extent
than 440Hz, though. Conversely, assume 0 is given; then the signal x(n),
n = 1 , . . . , T can be seen as a random vector denoted by x = [x(l), ,x{T)]^.
It admits a joint pdf p(x|0) = p(x(l), x ( 2 ) , . . . , x{T)\6), conditional on 6: by
changing 0, the signal pdf p(x|^) is changed. In the sine example presented
above, assuming the noise is Gaussian, the covariance matrix of e is diagonal
of size T, with the variance a"^ of each e(n), n = 1 , . . . , T on its diagonal. This
pdf is
where f{6) is the model given in (2.30) written in vector form, i.e., f(0)
[asin(27rA:ol + (/>o), ., a sin(27r/coT -j- ^o)]""".
The mathematical object p(x|^) admits two interpretations: First, when
read as a pdf of x for a given 6, p(x|0) is called the conditional pdf of x^
conditioned on 6. Second, when seen as a function of 6 for given x, p(x|0)
is called the parameter likelihood function and it is defined over the space of
all possible values of 9 denoted O. For example O = [0,1/2] x [0,27r] x [0, oc)
in the sine example (2.30). Note that the function p(x|^) is not a pdf of 6,
because its integral w.r.t. 0 over 0 may not equal one.
2 An Introduction to Signal Processing 33
Fig. 2.7. Likelihood p(x|fco, </)o, Q;) in (2.32) as a function of the frequency ko and
the amplitude a for a sine wave with frequency ko = 0.3 and amplitude a = 2. The
sine wave is corrupted with a white Gaussian zero-mean additive noise with variance
a^ = 4. Both this variance and the initial phase (^o are assumed to be known. The
likelihood shows a very sharp peak, which leaves no doubt about the parameter
value, but which may be hard to localize.
Bias() = E p ( ^ ) ( ^ ) - ^ g t . (2.35)
Note t h a t the bias characterizes the estimator (e.g., the ML estimator applied
to the model (2.31) in the sine wave example) and not an estimate 0. An
estimator is said to be unbiased whenever Bias(0) = 0, and this is, of course,
a very important property.
T h e estimator covariance matrix Var(^) is defined for b o t h biased and
unbiased estimators (in the latter case, it is also referred to as the mean
square error):
Var() - E[(e -E[d])(e -E[d]y]. (2.36)
^The definition of a lower bound for scalar numbers has to be adapted to matrices:
A matrix A is a lower bound for the covariance matrix U if and only if the matrix
E A is a, definite non-negative matrix (where A and U have the same size);
see [295].
2 An Introduction to Signal Processing 35
typically because we have longer records (T -^ CXD), we can study the estima-
tor asymptotic properties. In particular, the estimator is said to be consistent
whenever the variance Var(0) tends to zero.
ML estimators are asymptotically unbiased (in some cases, they are even
unbiased for any number T of data); they are consistent; and they reach
asymptotically the Cramer-Rao bound. Because of these excellent properties,
maximum likelihood approaches are widely used in signal processing. When
a good model is defined, the implementation difficulty consists mainly of the
optimization problem (2.33), which can usually be solved by the expectation-
maximization algorithm.
Gaussian mixture models are quite important in audio processing. For ex-
ample in speech processing the data considered are generally sets of cepstral
coefficients. However, this algorithm is not restricted to cepstral data. In gen-
eral, GMMs are used to estimate the pdf of a set of data. This is because they
have two key properties: 1) given that there are enough Gaussians in (2.37),
GMMs can approximate any pdf (versatility), and 2) finding their parameters
is easy thanks to the EM algorithm.
Consider the set of data X = { x i , . . . , Xm}, where each individual datum
is a vector in R^^. In the speech processing example mentioned above, each
datum Xi {i = l , . . . , m ) is made of the first dx cepstral coefficients of an
audio signal frame. Gaussian mixture modelling consists of fitting a mixture
pdf made of J Gaussians on each datum in X. The mixture pdf is
J
p(x,|{/?M.,i;,},-=i,...,j) - ^/?,Ar(x,;Ax,-,^,), (2.37)
m
(2.38)
i=l
1 1
1 1
1 1
1 1
^ 1
1
1
1
^ 1 1
I 1
I I /'
1 1 /
' ; /
/ \ ; '
/ X 1 '
r i - - -^.
...^ .. *_
-10 -6 -3 -0.2 2 10
a
Fig. 2.8. Mixture of five Gaussians. The five Gaussians composing the mixture
are represented in dotted Hues, whereas the full mixture pdf is represented in solid
Hne (with offset -hl.l for better visibility). The mixture coefficients are pi = ... =
05 = 1/5. The means of the Gaussians are [6, 3, 0.2, 2, 5] and the variances are
[1, 0.16, 0.64,1, 3.24]. Any random variable distributed according to this mixture pdf
may be generated either directly from the mixture pdf, or by first selecting randomly
one of the five Gaussians, and then generating the variable from it. For example,
the dot at abscissa 1.8 can either result from direct sampling form the mixture in
solid fine, or by first selecting randomly one of the Gaussians (here, the Gaussian
centred which mean equals 2), and samphng from this Gaussian.
Algorithm 2.1 is quite simple. The two steps used to compute 0^-^'^^^ from
e^^^ are
The expectation step: compute Qx(^|^ ) using (2.41).
The maximization step: maximize (5x(^|^ ) with respect to 0. This yields
00+1).
The principle of this algorithm [471] is that we want to maximize the log-
likelihood log p(X, Z\6) with respect to 6 without knowing the latent variables
Z. The expectation in (2.41) permits us to get rid of the latent variables, so
as to perform the maximization. However, since this expectation is computed
for the parameter value 6^-^^, which is not the 'true' value, the expectation in
(2.41) is not the 'true' one. The iterations in Algorithm 2.1 ensure that 0^^^
becomes closer to the 'true' value, yielding a better approximation of the true
expectation in (2.41), and thus a better ML estimate at each iteration.
where P{zi = j\:s.i,6j) is computed using Bayes's rule; see (2.51) in Sec-
tion 2.2.3, p. 31. Moreover, Qx{0\e^^^) in (2.43) can be maximized analyt-
ically. Defining S. ^u) = J2^i ^i^i Jl^i^^j )^ ^^^ ^ ^ update equations
are [38]
^(i+i) 1
^j=7;;^.ov^^ (2-44)
m
where Lx(^) is defined in (2.34) and f2[6) is aimed at lowering the likelihood
in parts of the space which are to be avoided. ^^ In the Gaussian mixture
example where J > m^ the penalty term can be set so as to disable variance
parameters that are too small; see [91]. The EM algorithm can be adapted in
order to address penalized likelihood problems [251].
An important application of penalized likelihood concerns the model se-
lection problem^ where one tries to estimate the probabilistic model that best
fits some data. A typical example is that of autoregressive models.
Autoregressive Models
Autoregressive models form a major signal processing tool. They are used
for spectrum estimation, coding, or noise reduction. An autoregressive (AR)
model (also called a linear prediction model) expresses a signal x{n) at time
n as a linear combination of its previous values:
where -Kn-i.n-p = [x{n - 1),.. .x{n - p)]^ and assuming, e.g., Xn-i-.n-p =
[0,0,..., 0]^ for n < ]9. Here, we denote by x the vector made of signal sam-
ples x(n), n = 1,...,T. Maximizing Lx(a,p) yields uninteresting solutions,
typically p becomes very large. This problem can be overcome by maximizing
instead the penalized log-likelihood (2.47). Choosing Q{a,p) = p and A = 1
leads to Akaike's penalized log-likelihood, which leads to reliable model order
estimation [252]:
T
ir{a,p) = -|log(27ra)-^^||x(n)-aTxj;f-p. (2.50)
n=l
that the parameter is well defined insofar as the model is correct, but in real
cases, the model is always an approximation of the real world. In Bayesian
approaches, it is assumed instead that the unknown parameter is a random
variable, characterized by a pdf. This pdf is stated before any data are col-
lected and is called the parameter prior distribution,
^o 1^ ^ p(ai|a2)p(a2) , .
p(a2 ai) = J -, (2.51)
J^^p(ai|a2)p(a2)(ia2
which enables us to 'reverse the conditioning'. Note that the denominator in
(2.51) equals / ^ p(ai, a2), daL2 = p(ai), but we keep the integral form in order
not to be confused with a possible prior over ai which would also be denoted
p(ai). Bayes's rule can be applied straightforwardly to Bayesian parameter
estimation. Assume we want to learn the value of some parameter 0 E 0 from
a set of data X. We already have the likeUhood p(X|0), and, being Bayesian,
a parameter prior p(0) is selected. Using Bayes's rule,
In (2.47), the penalty term is introduced in the log likelihood. Going back to
the likelihood via the exponential function, we can rewrite it as
and it appears that the likelihood has been multiplied by a term that only
depends on 6. Assuming that the following integral is finite,
0Lexp(AJ?(0)) dO = C, (2.57)
P W = ^exp(Ar2(0)). (2.58)
Bayes's theory provides a general framework for statistical inference. The main
concept is that of posterior distributions, which result from both the likelihood
and the prior. However, computing estimates such as (2.54) and (2.55) is
a difficult problem in the general case. For example, MMSE estimates are
obtained by the following integral:
0^ / ep{e\x)de. (2.59)
j0
^^It is worth mentioning that the product of two Gaussians is Gaussian; thus the
posterior pdf is also Gaussian in this example, which explains why its mean value
coincides with its maximum.
42 Manuel Davy
1 ^
^[h] ^ -J2^{i/N)n{i/N). (2.61)
The limit of this approach is that the grid size increases exponentially with
the dimension de of 0 : assuming 100 grid points are used in each dimension
of of dimension do = 50, the grid size is 100^^ == 10^^^; this is out of reach
of today's computers.
However, another numerical computation technique can be implemented.
Assume random samples 6^^\ i = 1,...,A^ are available, where each sam-
ple 0^*^ is distributed according to 7T{0) (this is denoted ^^^^ ~ TT{6) in the
following). Then, from the law of large numbers,
W - ^y VE [ h ( ^ ^ ^ V ^ i v [ h ] ] ' . (2.63)
2=1
Monte Carlo methods may also be used to compute other kinds of esti-
mates, inside or outside of the Bayesian framework. For example, Monte Carlo
optimization methods may be used to compute maximum likelihood of MAP
estimates; see [544].
We have assumed so far that Monte Carlo samples are available. The real
difficulty is actually in generating these samples. When 7r(0) is a standard pdf
2 An Introduction to Signal Processing 43
The principle of MCMC algorithms is as follows: Given some pdf 7r{0) we want
to sample from, a chain of samples is generated iteratively; see Algorithm 2.2.
The chain is statistically fully determined by the pdf of the initial sample 0^^^
denoted 7ro{0) and a so-called Markov kernel IC{6\0'), which is a pdf w.r.t. 6
(for fixed 6'). Provided the kernel IC{6\0') satisfies some properties [544], the
pdf of each sample 6^'^^ slowly converges to the target pdf 7T{0) as i increases. In
particular, the kernel must be built so that 7r{0) is the invariance distribution
of/C(0|^'), namely.
L 0
lC{e\e')'K{e') dO' = niO) for all ee0. (2.64)
that we can sample easily from each of the conditional pdfs p(^i |^2, > ^de)^
p(6>2|<9i, (93,..., ^de). ^ Pi^do|<9i,..., Ode-i)' Typically, this situation hap-
pens when some of the conditionals are Gaussian and the others are, for
example, gamma distributions. The Gibbs sampler consists of samphng one
component of 0 at a time from the conditional posteriors, as presented in
Algorithm 2.3.
The Gibbs sampler is quite simple; however, it requires the ability to sam-
ple from the conditional pdfs. This is sometimes not possible to implement,
and we can use instead the Metropolis-Hastings (MH) algorithm.
With probability QMHC^*,^^*"^^), accept the candidate, i.e., set 6^'^ ^ ^*.
Otherwise (that is, with probability 1 - aMH(^*,^^*~^^), reject the candidate,
i.e., set(*) ^0^'-^l
^^The Dirac delta function can be viewed as the derivative of the step function
at point 0, where the step function equals zero over ] oo,0[ and one over ]0, oo[.
An important property is that Su{v) = 0 whenever u ^ v and for a function h(i;),
we have J \\(v)6u(y)dv = \\{u). When used in a probabilistic context, writing a has
distribution (5n(a) means that a. = u deterministically.
46 Manuel Davy
F i g . 2.9. Typical Markov chains produced by different Markov kernels for the
Metropolis-Hastings algorithm. The parameter sampled is a frequency having the
true value 0.25. a) Local, Gaussian random walk proposal pdf. b) Global, indepen-
dent proposal pdf. c) Mixture of a local and a global proposal. The local proposal
fails to explore the frequencies in a short time, whereas the global proposal keeps at
the same frequency for many iterations. The mixed proposal performs well.
A n E x a m p l e : B a y e s i a n E s t i m a t i o n of S i n u s o i d s i n N o i s e
X = D ( k ) a + 6, (2.69)
The amplitudes follow a Gaussian prior pdf with mean 0 and covariance
a'^^cti which is proportional to the additive noise variance (this is to adjust
the noise 'amplitude' to the sinusoids' average amplitude).
p(a) - A/'(a;[0,0]T,a2i;), (2.70)
(2.72)
where r{-) is the Gamma function [8]. This choice has two main justifi-
cations: First, for small i/Q and i/i, this density favors small values of cr^.
Second, this pdf is called a conjugate prior because the posterior distribu-
tion can be calculated in closed form. In practice, we may choose i^o <^ 1
and 1/1 <C 1 (e.g., I/Q = ^1 =^ 10~^); the precise selected values have little
influence on the estimation results.
Using these priors and the likelihood, it is possible to write the para-
meters posterior. Moreover, we can express the two conditional posteriors
p(k, a , cr^|7^,x) and p(7^|k, a , cr^, x). The former can be further decomposed
into the product p(k|7^,x)p(a^|k, 7^,x)p(a|k, cr^,7^,x), with
p(<T2|k,7^x) = Xe ( ^ ^ 2 . I + i l ^ ^ ^ + ^ T p ( k , ^ 2 ) ^ ^ ^ (2.74)
whereP(k,7') = lT-D(k)S(k,72)D(k)TandS(k,72)=j^[D(k)"rD(k)]-\
The hyperparameter conditional posterior is the inverted gamma pdf
p(7^|k,a,cT^x) = j g ( 7 2 ; ^ 3 + i , ^ ! H ( ^ M ^ + ^ , ^ , (2.76)
48 Manuel Davy
where the parameters are chosen as z/3 = 2 and z/4 = 20 to ensure t h a t the
prior over 7 has an infinite variance and a vague shape. Following [21], we can
now write the M C M C algorithm t h a t mixes a Gibbs sampler with one local
and one global MH kernel.
1. Initialization:
Sample 7^^^ ~ Z6?(72; 1/3,^^4).
For m = 1 , . . . , M, sample/cm ~ q^(A:m|x).
2. Iterations: For z = 1, 2 , . . . , AT do
For m = 1 , . . . , M do
- Frequency MH step
With probability /3, sample a candidate using a local proposal k^ ~
^l\'^m\^Cm )-
Otherwise, sample a candidate frequency using a global proposal k^ ~
0.25
"''rTiTf 1 'I yi 1 1 n
Fig. 2.10. The frequency k and amplitudes a vectors and the noise variance a^
sampled by Algorithm 2.5 for m = 1 and M = 12. As can be seen, convergence
is reached after 200 iterations. The true parameter values are mi = 0.12, a\ = 1,
al = 0, and a^ = 2.
2.3.6 I m p o r t a n c e S a m p l i n g a n d S e q u e n t i a l I m p o r t a n c e S a m p l i n g
N
/^[h] = ^5(^)h(6l^^)) ^ /[h], with 5 (i) (2.78)
i=l
q(0(^))'
where i}(^) is the importance weight of sample ^^^^, and is aimed at correcting
the discrepancy between q{0) and 7r(0). A key remark is t h a t the variance of
the estimator (2.78) strongly depends on the importance pdf q(^) selected. It
can be demonstrated t h a t , for a given number of Monte Carlo samples, the
50 Manuel Davy
Particle Filtering
/ n + l [ J + Vn-f 1
^^
fXn-1j
Fig. 2.11. Graphical illustration of the dynamic model in (2.79) and (2.80). In this
model, only Xn is observed. The state parameter vector On is hidden for n = 1,2,
On follows a Markov evolution, hence the name hidden Markov model.
It is now possible to sample the state at time n using qn(^n|^n 1) and com-
pute sequentially the weight as the ratio p(0o:n|xi:n)/qn(^O:n)- These elements
lead to the particle filter presented in Algorithm 2.6 below.
The particle trajectories are updated using the sequential importance pdf
- For i = 1,... ,N, sample the new state at time n for particle i,
0W ~ q n ( 0 n | 0 i ^ l j . (2.83)
qn[tfn \t^n-l)
2.4.1 K-Means
The i^T-means algorithm is an unsupervised classification method. The set
of data X = { x i , . . . ,Xm} in X is provided. Given the number of classes,
denoted K, the algorithm assigns a label in 3^ = { 1 , . . . , K} to each x^. The
algorithm is summarized below, where d(-, ) is a distance measure in X,
3. Keep the dpca columns of S that correspond to the dpca largest eigenvalues
in A. They are stored in a matrix Spca
with size dx x dpca and the
corresponding eigenvalues are in the squared matrix Apca with size dpca-
4. Compute the lower dimensional data for i = 1 , . . . , m:
C, = Spca(xi - Mx"")- (2-88)
The matrix diagonalization in (2.87) is almost always possible^^ whenever
m is larger than dx- The transform in (2.88) reduces the dimension of x^
^^The empirical covariance matrix in (2.86) is computed with normalization term
1/m. The corresponding empirical covariance matrix is known to be a biased esti-
mate of the true covariance matrix. The unbiased empirical covariance matrix should
be computed with normalization term l/(m 1).
^^ Details about eigenvalues, eigenvectors, and matrix diagonalization may be
found in any textbook about matrices; see, e.g., [295].
^^A property is said to be almost true (here, the property is 'diagonalizing the
covariance matrix is possible') if it is true with probability one. Surprisingly, even
though it is true with probability one, it is still possible that the property is false,
but this is quite unlikely; see [548] for further details. Here, the property would be
false if at least one datum was a linear combination of others.
2 An Introduction to Signal Processing 55
c(x,y;F(x))
Quadratic
Hinge (y = + 1 ) ^ y Hinge (y = - 1 )
0-1 loss
F(x)-y
Fig. 2.12. Standard loss functions to be used in learning problems. The quadratic
loss c(x,y; F(x)) = (y F(x)) is used in so-called least square methods, whereas
the hinge losses are used in support vector machines. The 0-1 loss c(x,y; F(x)) =
1 (5y(F(x)) is often used in Bayesian classification problems.
From loss functions, we can define the risk R [F] of a classification function
F as the expected loss over all possible pairs (x, y), namely
where p(x, y) is the 'pdf' of the data and labels. Of course, this 'pdf' is un-
known in practice, and the risk cannot be computed directly. Assume however
that R [F] of any classification function F could be computed: the optimal clas-
sification function could be found by minimizing the risk w.r.t. F over a set
of functions denoted T. This approach is not really possible, but the risk can
still be estimated from the training set. Define the empirical risk
^ m
Remp
(X,Y)
[F] = - V c ( x , , y , ; F ( x i ) ) (2.90)
/^^i: 'ix^s^B-fu^.-^^,
D D
" 1 /
I'
Fig. 2.13. Supervised classification into two classes with 2-dimensional data. In the
training set (X, Y), data with label y = 1 are represented with dots, whereas data
with label y = 1 are represented with squares. The dotted line is a classification
function F such that R(x Y) ^\ ~ ^- Though it achieves zero empirical risk, F is not
a good classification function, as it makes an error for a new datum which is not in
the training set (circle at the bottom, with the true label y = 1).
2 An Introduction to Signal Processing 57
2.4.4 Regularization
Minimizing the empirical risk leads to useless functions F. This problem can be
overcome by minimizing an objective function that includes both the empirical
risk R/xV) [^] ^^^ ^ term i? ( F ) that penaUzes unwanted solutions. Define the
regularized risk
R^^[F] - R^5^;V)[''] + ^ ^ ( F ) , (2.91)
where A is the regularization tuning parameter. The optimal classification
function F(X,Y) is found by solving
Support vector machines (SVMs) are specific instances of the above regular-
ization scheme. In SVMs,
k(x,x')=exp|-2i5l|x-x'f}. (2.94)
In SVMs, the hinge loss Chinge (x,y; F(x)) (see Fig. 2.12) is chosen and
the penalty term i^ (F) is the squared norm induced by the dot prod-
uct in W, i.e., i? (P) = ||f ||f^. For any x, from the reproducing property,
F(x) = (k(x, ),f(-))'^ -f 6, which means that classifying a datum x is an
affine operation (that is, a linear -h constant operation) in terms of elements
of 7i, because the dot product (f(*)?f'(*))w is linear w.r.t. both f(-) and f (),
and a non-hnear operation in terms of elements of A'. In SVMs, the regularized
risk (2.92) becomes
^ m
f^"[''] = - J ] c , , , . ( x y , ; ( k ( x , , - ) , f ( - ) ) w + b ) + A | | f | | 2 , . (2.95)
2=1
1 \ ^
-^\\\i-'^p-^ m 5. Z ^ i with respect to f,p,^i,b, (2.96a)
1=1
with
and
Ci > 0, for alH = 1 , . . . , m , p > 0, (2.96c)
where the slack variables ^i {i = 1 , . . . , m) are used to implement the hinge
loss [568]: they are non-zero only if y^ ((k(x^, ')j{'))n + &) < P and, in this
case, they induce a linear cost in the objective function ^ ||f ||?{^P+^ S I ^ i ^i-
Introducing Lagrange multipliers a i , . . . ,am, the optimization in (2.96) can
be turned into the equivalent quadratic convex problem with linear constraints
(dual problem): Maximize
^ m m
- 2^ 2^ Oiiajyiyjk{xi,:x.j) with respect to a^ (z = 1 , . . . , m), (2.97a)
2=1 j = l
with
The set of Lagrange multipliers a^, 2 = 1 , . . . , m that solve (2.97) leads to the
optimal classification function F(X,Y)^ which is written for all x G A'
771
F(x,Y)(x) = 5 ] y , a i k ( x , , x ) + 6 . (2.98)
The classification of a new datum x into one of the two classes {1;+!}
is performed by assigning to x the class label y = sign (F(X,Y)(X)) where
F(X,Y) is given in (2.98). The support vector machine admits a gemoetrical
interpretation; see Fig. 2.14.
Fig. 2.14. Geometrical interpretation of the i^-soft margin support vector machine
in the reproducing kernel Hilbert space H. Each element of Ti is a function which
is plotted here as a point in an infinite-dimensional space (here, 7i is represented
as a two-dimensional space). The classification function f is the vector othogonal to
the hyperplane with equation (f, g)?t -\- b = 0. This hyperplane separates the data
from the two classes with maximum margin, where the dots represent the functions
k(xi,) with label y^ = 1, and the squares represent the functions k(xj, ) with
label yj = I. The margin width is p/||f||7^. As this is the soft margin SVM, some
training vectors are allowed to be located inside the margin. The vectors located on
the margin hyperplanes (dotted lines) and inside the margin are called the support
vectors.
D
D
D
D
In the sine signals example, the training set is cut into X(i) and X(^_i)
containing respectively the signals with frequency A:(i) and k(^_iy The posterior
of A:(i) (respectively A:(_i)) is computed independently for each class.
Whenever the posterior is computed, it is possible to implement the clas-
sification of X via the computation of the class likelihood
Hidden Markov models (HMMs) are widely used in speech recognition mainly
because they offer a robust pattern recognition dynamic scheme. HMMs in-
clude continuous state space systems such as (2.79)-(2.80): The parameter
vector On hves in a continuous state space 0; its evolution follows a so-called
Markov process (in the sense that the pdf of On only depends on the pdf
of On-i at each time n = 1,2,...); and the state cannot be observed di-
rectly (it is hidden). In the speech and audio processing literature, HMMs
mostly refer to finite, discrete state space dynamic models', that is, On is a dis-
crete random variable in a finite space with E possible state values, namely
0 = { e i , . . . , CE}', see Fig. 2.11. A finite, discrete state space HMM is governed
by
the state transition probabilities P{On = ei\On-i = ej) for (i,j) =
1 , . . . , Ethis is the discrete equivalent of (2.79);
the state likelihoods p(xn|0n = e^), i = 1 , . . . , Eequivalent to (2.80);
and the initial probabilities P{Oi).
Similar to the particle filtering problem, the issue is here to estimate the
sequence of states over time. Typically, in speech recognition the observations
are MFCCs extracted from frames over the speech signal, and the state is
the phoneme pronounced by the speaker. Likelihood functions are typically
Gaussian mixtures over the MFCCs (there is one GMM for each possible
state). Transition probabilities and GMM parameters are learned from a large
database and from the speaker's voice.^^
The aim of the Viterbi algorithm is the estimation of the sequence of states
Oi:T from time 1 to time T by maximum a posteriori. In other words, given
a sequence of observations Xi:7^, the Viterbi algorithm finds the sequence of
states Oi:T such that
1. Initialization:
for i = 1,...,E, set wi{ei) = p(xi|^i = ei)P{Oi ~ e^).
2. Iterations: Forn = 2 , . . . , T ,
For z = 1 , . . . , ", compute
3. Termination:
Compute OT = max i/;T(et).
4. State sequence backtracking
3.1 Introduction
Musical signals are, strictly speaking, acoustic signals where some aesthet-
ically relevant information is conveyed through propagating pressure waves.
Although the human auditory system exhibits a remarkable ability to interpret
and understand these sound waves, these types of signals cannot be processed
as such by computers. Obviously, the signals have to be converted into digital
form, and this first implies sampling and quantization. In time-domain dig-
ital formats, such as the Pulse Code Modulation (PCM)or newer formats
such as one-bit oversampled bitstreams used in the Super Audio CDaudio
signals can be stored, edited, and played back. However, many current signal
processing techniques aim at extracting some musically relevant high-level in-
formation in (optimally) an unsupervised manner, and most of these are not
directly applicable in the above-mentioned time domain. Among such seman-
tic analysis tasks, let us mention segmentation, where ones wants to break
down a complex sound into coherent sound objects; classification, where one
wants to relate these sound objects to putative sound sources; and transcrip-
tion, where one wants to retrieve the individual notes and their timings from
the audio signals. For such algorithms, it is often desirable to transform the
time-domain signals into other, better suited representations. Indeed, accord-
ing to the Merrian-Webster dictionary,^ to 'represent' primarily means 'to
bring clearly before the mind'.
Among alternate representations, the most popular is undoubtedly the
time-frequency representation, whose visual counterpart (usually without
phase information) is the widely used 'spectrogram' (see Chapter 2). Here,
at least visually, higher-level features such as note onset, fundamental fre-
quency or formants can be distinguished and estimated. However, a major
Time-frequency domain
Time-frequency domain
Fig. 3.1. Although equivalent, these four representations of the same signal (impulse
response of an open tube, kindly provided by M. Castellengo) highlight different
features of the signal (for the sake of clarity, only the magnitude of complex values
has been displayed).
a few specific situations, and use either the discrete or the continuous setting.
We use the following notation: we reserve the letter n to denote discrete time
variables (i.e., we denote discrete time signals as x(n), n G Z), and the letter
t for continuous time signals x{t), t eR.
where the phase (pm of the m-th partial is sometimes written as a primitive
of a smooth time-dependent frequency fm{t)-
pn/ks
^m{n)= I 27Tfm{t)dt-^(frn{0),
Jo
ks being the sampling frequency and {n) representing the error of the model.
Here, it is implicitly assumed that the amplitude and frequency parameters
(<^m? fm) for each partial sinusoid evolve slowly over time, in such a way that
their values can be estimated frame by frame (a typical frame size is 23 ms,
representing 1024 samples at a 44.1kHz samphng rate).
The selection of the partials is made first by peak-picking the magni-
tude of the short-time Fourier transform of the signal. Chaining the obtained
peaks into partials (i.e., curves n > fm(ji) for all m in a joint time-frequency
domain) is a non-trivial task which has received significant attention since
the early contribution of [449], including the hidden Markov chain approach
of [139], or linear prediction [383]. Whatever the chosen approach and al-
gorithm for partial chaining, the underlying idea is to exploit the supposed
smoothness of the time-dependent frequencies fm {t), and chain the peaks that
are close to each other in the frequency domain. Note that, for resynthesis us-
ing (3.1), the requirement of phase continuity at frame boundaries implies
some non-trivial interpolation scheme for the partials frequencies /m(0 (^^^
instance a cubic spline interpolation).
3 Sparse Adaptive Representations 69
Time Time
Fig. 3.2. Local spectrum peaks, chained into partials (left). Partials grouped into
locally harmonic structures (right): partials corresponding to two locally harmonic
sources are represented by full lines and dashed lines respectively.
The partials may in turn be used for various purposes, including harmonic
source separation and possibly transcription, as shown in [652]. There, dis-
tances between time-dependent amphtudes am{t) and frequencies fmit) are
proposed, together with a measure of harmonic concordance between partials.
A perceptual distance between partials involving these three distances is then
built, and numerically optimized for grouping partials into locally harmonic
sources. An elementary example of such grouping is presented in Fig. 3.2.
This model has been refined by Serra in the spectral model synthesis (SMS)
model [576], where the residual of a sinusoidal model is also taken into account
as a so-called stochastic part. This modification has made it possible to perform
high-quality processing of general musical signals [649] (see also [408]) that
have a more complex behavior than speech signals. AppUcations that make
use of SMS range from audio effects (morphing, time-scaling, etc.) and source
separation to sound analysis (transcription).
One of the main limitations of the SMS approach is the lack of an explicit
model for the residual part. This results in having to keep this residual in the
time domain, hence requiring a very large number of parameters. As a first-
order approximation, this residual can be modelled as filtered white noise, the
parameters of the filter being estimated by the energy in perceptual frequency
bands [219] or through classical autoregressive (AR) or autoregressive mov-
ing average (ARMA) methods [472] (see Chapter 2). This solution is used in
the Harmonics and Individual Lines plus Noise (HILN) coder [532]. With fur-
ther improvements such as grouping harmonically related components, HILN
achieves fair sound quality at bit rates as low as 6 kbit/s.
However, in all the above-described modelling of the residual, there is
always the implicit assumption that the parameters evolve slowly over the
analysis frame. Obviously, this does not always hold; for instance it does
not hold at the sharp note attacks of many percussive instruments. These
fast-varying features are in certain cases characterized by sudden bursts of
noise, and/or fast changes of the local spectral content. These components
70 Laurent Daudet and Bruno Torresani
A nice feature of the approach above is the fact that given estimates for
parameters of the model, the latter may be used for signal synthesis. When
synthesis is not necessary, it is no longer necessary to start from the signal
waveform, and it may be easier to start from other representations of the
signal, for example a short-time Fourier spectrum. This approach has been
taken, among others, in [644], [645], where a new stochastic model for musical
instruments local spectra was introduced. The model parameters are first
estimated on a training set, and may then be used for transcription. Since only
spectra (and not signals) are modelled in this approach, separation may only
be performed by an appropriate post-processing, e.g. local Wiener filtering.
We refrain from discussing this approach in more details here, and refer to
Vincent [644] for a thorough description.
(/ being either the real line R or some subset), with inner product
{x,y) = Y^ x{n)y{ny ;
n = oo
The (finite-dimensional) space C ^ , with inner product
N-l
{x,y) = Y^x{n)y{ny .
n=0
Waveform Bases
As mentioned above, the signal space may be finite dimensional (for exam-
ple finite-length discrete signals or finite-length band limited continuous time
signals), or infinite dimensional (for example finite or infinite support discrete
or continuous time signals), but the general framework remains the same. We
shall limit ourselves to spaces of finite-energy signals, i.e., L^ spaces in the
continuous-time case, and spaces of square-summable sequences in the discrete
and finite cases, with norms and inner products as above.
Remark 2. The reader who is not interested in the mathematical details may
simply remember that in the case of discrete time, finite-length signals (i.e., a
finite-dimensional signal space), expanding a signal onto an orthonormal basis
is nothing but applying a unitary matrix to the signal vector (a simple example
is the DFT matrix). The inverse operation, i.e., reconstructing a signal from
the coefficients of a basis expansion, is also a matrix-vector multiplication,
using the Hermitian conjugate (i.e., the complex conjugate of transpose) of
the transform matrix.
Among the orthonormal bases, wavelet and local cosine (with extensions
such as the modified discrete cosine transform, or MDCT) bases have been
particularly popular. The main difference between these two systems is the fact
that the time and frequency resolution of local cosine waveforms is uniform
in the time and the frequency axis, while wavelets offer finer time resolution
(and thus broader frequency resolution) at high frequencies.
Wavelets
Wavelets (see for example [113] and [429] for thorough reviews) are generated
from a single 'atom' i/; by regular translations and dilations. In the continuous-
time settings, a deep result by Mallat and Meyer states that it is possible to
construct (continuous-time) functions -0 G I/^(R) such that, introducing the
corresponding translations and dilations %l)jn of il) defined by
xjj^^it) = 2-^/2 IIJ{2-H - n) , t G M,
the family {ipjn, j,n G Z} is an orthonormal basis of L^(R). Therefore, any
signal X G L^(R) may be expanded in a unique way as
^(^)= m J2 i^^'^Jn)^Jn{i)-
-oo n= oo
74 Laurent Daudet and Bruno Torresani
i =2
i =4
i =5
Fig. 3.3. Dyadic tree of scale-time indices for a wavelet basis ^jn-
oo jo 1 oo
n= oo j= oo n oo
Local cosines
n= oo k=0
0.2
0 1 ^\V
^1 1
-0.2
-0.4
1 -
-0.6
n Q
200 400 600 800 1000 ^'^0 200 400 600 800 1000
0.2r
-0.1
1000 1000
Fig. 3.4. Comparing wavelets and local cosines. Daubechies 6 (top left) and
Vaidyanathan (bottom left) wavelets for three different values of the scale-time in-
dex (i, n), and narrow (top right) and wide (bottom right) local cosine atoms for
two different values of the time-frequency index (n, k).
three wavelets with different scales and position, using Daubechies wavelets of
order 6 (top) and Vaidyanathan wavelets (bottom).^ This illustrates a general
property of wavelet bases: the smoothness of the wavelet increases with the
size of its support.
On the right are represented local cosine atoms corresponding to two dif-
ferent window sizes (narrow window on the top plot, wide window on the
bottom plot) for two different locations and frequencies. In both cases, the
window function is compactly supported, which results in a poor frequency
localization for its Fourier transform.
freedom in the construction of the basis, and such non-orthonormal bases may
be constructed from 'nicer' time-frequency atoms (for example, the atoms can
be made smoother, or more symmetric, or better time locaUzed). Another vari-
ant is obtained by assuming that the waveform basis is obtained from two or
more time-frequency atoms rather than a single one. Again, one obtains in
such situations 'nicer' time-frequency atoms. Examples of such a strategy are
provided by the multiwavelets, or the more general multiple bases (see for
example [17], [545]).
A\\xf<Y,\{x,'P^)?<B\\xf. (3.7)
iex
This imphes in particular that the frame is complete in H (as defined above),
and therefore that any x eH may be expanded as in (3.2). The main differ-
ence is the fact that the frame is generally not exacts which means that the
waveforms {^i^i G T} form a redundant system in H (one sometimes speaks of
^Notice that contrary to a common usage in the signal processing literature, the
term 'frame' does not represent a time interval, but rather a family of vectors in
a vector space. Since this terminology is also standard in mathematics, we shall
nevertheless use it here.
^While Parseval's formula (3.5) expresses energy conservation, such a weak Par-
seval's formula expresses energy equivalence, i.e., the fact that the energy of the
sequence of coefficients is controlled by the energy of the signal.
78 Laurent Daudet and Bruno Torresani
iei
which may be interpreted as an inversion of the transform
X -^ {(x,(^i),i G X}. In both situations, the coefficients {x,ipi) or {x,(pi)
provide an alternate representation of the signal that often proves useful for
several signal analysis tasks.
The most classical choices for waveform frames are provided by the so-
called Gabor frames and wavelet frames^ whose basic theory is discussed in
detail in [113] (see also [184] and [261]). We shall hmit our discussion here to
Gabor frames and generalizations. The reader more interested in wavelet and
multiresolution frames is invited to refer to the vast literature on the subject
(see for example [89], [113] for tutorials).
Gabor frames, which actually correspond to sampled versions of the short-
time Fourier transform described in Chapter 2, have been fairly popular in the
musical signal representation community. Gabor frames are generated from a
unique window function g by time and frequency translations. In the discrete
time '^{Z) setting, the corresponding discrete Gabor functions gnk (where n
and k control time and frequency respectively) read
9nkil) = e2,.fcfe.(;-n3)^(; _ ^^^) (3 10)
Here, rig and ks represent time and frequency sampling rates, respectively. It
may be shown that under some mild assumptions, for any window g G ^^(Z),
the family {gnk^ n. A: G Z} is a frame of ^{Z) as soon as the product nsA:s
is small enough. The latter essentially controls the redundancy of the Gabor
frame: the smaller ng/cg, the closer the atoms. For example, increasing rig
reduces time redundancy, which may be compensated for by decreasing kg.
Many examples of signal processing applications of Gabor frames may be
found in the literature; see e.g. [185].
Gabor frames are often well adapted to musical signal processing, as they
provide 'direct access' to time and frequency variables simultaneously. Also,
using Gabor frames rather than the corresponding orthonormal bases^ allows
^A 'no-go theorem' known as Balian-Low phenomenon states that there cannot
exist 'nice' orthonormal bases of Gabor functions [113].
3 Sparse Adaptive Representations 79
one to use 'nicer' (i.e., smoother and better localized) windows, and keep
translation invariance, which is a very important feature. On the other hand,
Gabor frames are made out of fixed resolution waveforms, which can therefore
not be adapted to the features of the signal. Hence, it is difficult to find in
such schemes a representation that would be well adapted to both transients
and partials of sound signals.
A good illustration of this fact is provided by the two images in Fig. 3.5
below, in which are displayed the Gabor frame expansions of a short piece
of guitar signal using two different windows. It clearly appears that a Gabor
frame generated using a narrow window (left image) is able to capture accu-
rately the transient parts of the signal, while a Gabor frame generated using
a wide window is much more precise for capturing partials. Neither of them
is able to do a good job for both types of components.
^ 8
CP
^ A\
0.2 0.4 0.6 0.8 1.2 1.4 0.2 0.4 0.6 0.8
Fig. 3.5. Grey level images of two different Gabor frame representations of a short
piece of guitar sound. Left: narrow Hanning window. Right: wide Manning window.
From the above example, the following question arises naturally: Is there
a way of decomposing a signal into 'layers' that could be adequately repre-
sented by an appropriate waveform system (Gabor frame or other)? A very
interesting outcome of the frame theory is the fact that given a pair (or a
larger, finite family) of frames in a given signal space, their union is still a
frame of the same space, and thus suitable for expanding signals. Frames gen-
erated as unions of Gabor frames are called multiple Gabor frames. Multiple
Gabor frames have been considered in the literature for various purposes,
including source separation in signals and images, or musical signal process-
ing; see for example [150], [312], [690]. Examples discussed in these references
were generally based on a family of Gabor frames with identical windows at
different time scales. The goal in such situations is to be able to represent
80 Laurent Daudet and Bruno Torresani
10^
i ^ i l l / ^ ^ jA< kf U^^
1000 2000 3000 4000
10
11 1,1 It i h''f 1
TjjTT"^^ r 1
0 r 'f^-p ij
1 '
-10 ' 1 1
Fig. 3.6. Multilayered decomposition of a synthetic signal, obtained using the Time-
Frequency Jigsaw Puzzle technique, described in Section 3.3.2 below. From top to
bottom: original signal, tonal layer, and transient layer.
Remark 5. The need for adaptivity: In some sense, such multiple Gabor frames
provide a way to retain the best of the Gabor and multiresolution worlds. How-
ever, achieving such a program turns out to be quite difficult for practical
3 Sparse Adaptive Representations 81
purposes, mainly because one has to deal with the extra redundancy intro-
duced by the use of several frames together. Indeed, given R Gabor frames
idnk^ (n, /c) G J , r = 1 , . . . , i?}, there is an infinite number of ways to expand
any x W as
Dictionaries
By 'waveform dictionary', one generally means a family of waveforms which
is more redundant than a frame. By definition, a dictionary in some signal
space H is a complete family of elements of 7Y, that is, a family such that
any signal x H admits an expansion as a linear combination of elements
of the dictionary. In infinite-dimensional signal spaces, dictionaries may even
not be frames, as they may contain too many elements for the right-hand side
inequality of (3.7) to be satisfied. However, when it comes to practical situa-
tions, i.e., finite-dimensional signal spaces, the dictionaries which are generally
considered in the literature are also frames, so that the distinction between
'frame methods' and 'dictionary methods' refer to the techniques that are used
to find the expansion of signals with respect to such systems rather than the
intrinsic properties of these systems. Therefore, we shall address 'dictionary
techniques' in Section 3.3.2 below.
82 Laurent Daudet and Bruno Torresani
where the indices i i , . . . zj have been chosen so that the absolute values of the
corresponding coefficients ai. are sorted in decreasing order.
5 Time (s)
Fig. 3.7. Two sample musical signals: a castanet signal (top) and an organ signal
(bottom).
10" 10"
10" 10
-20 20
rS
10"
10"
I
10"'
-20 20
Fig. 3.8. pdf of various representations of the two sample signals of Fig. 3.7: ln(P(Q;))
vs a. Castanet (solid line) and organ (dashed line): Top: time samples and Fourier
coefficients; Bottom: wavelet and MDCT coefficients.
The Matching pursuit approach [430] (MP) is in the class of so-called 'greedy'
algorithms, i.e., according to the Wikipedia Encyclopedia,^^ an algorithm
which follows the problem-solving meta-heuristic of making the locally opti-
mum choice at each stage with the hope of finding the global optimum. The
MP is an iterative procedure that aims at approximating a signal through a
weighted sum of atoms such as in (3.12), where the atoms belong to a given
redundant dictionary V.
The basic principle of MP is as follows:
and approximated by the first term of the right-hand side of the above equa-
tion, provided the residual rj^i is small enough (in norm).
10
See http://en.wikipedia.org/wiki.
3 Sparse Adaptive Representations 85
In such a way a subset of each frame is selected, and the signal may
be iteratively projected orthogonally onto the corresponding subspace of the
signal space (as in orthogonal matching pursuit). At each iteration, a residual
signal is produced, and processed in a similar way. The iteration stops when
the precision is considered satisfactory. So far, no known proof exists for the
convergence of this method. However, the convergence is quite fast in practice:
less than 20 iterations are needed to achieve 300 dB signal-to-noise ratio.
In addition, T F J P provides decompositions of signals into 'layers' as fol-
lows: Assuming that two frames G^^^ = {9s } ^^^ ^^^^ ~ idx / ^^^ consid-
ered, the algorithms provides an expansion of any signal x in the form
where x^^^ (a:^^^) is the 'component' (termed 'layer') of the signal which has
been 'identified' by the frame G^^^ (G^^^), and A and A are (small) subsets of
the global index set (in general a subset of Z^).
The simplest instance of this method is based on a pair of two Gabor
frames, with significantly different window sizes; say, for audio signals, 5 ms
and 45 ms. The corresponding Gabor atoms, when used in the framework of
the T F J P method, identify nicely partials and transients. An illustration of
such a strategy on a simple synthetic signal may be found in Fig. 3.6 above.
As long as the signal can be correctly modelled as a superposition of partials
(with slowly varying amplitude and frequency) and transients, T F J P is able
to identify and separate them. In the presence of more complex phenomena,
the method should be refined, and should include different types of atoms (for
example chirps). However, as is well known from matching pursuit approaches,
enlarging the dictionary of atoms does not necessarily improve the accuracy
of the identification of signal components: the more redundant the dictionary,
the larger the ambiguity of the selection.
The basic principle of T F J P (see [312, variant 2]) is presented in Algo-
rithm 3.2 below.
Select supertiles for which window # 2 yields the smallest entropy. Reconstruct
corresponding contribution x] to layer 2.
Setn+i(t) = r,^i/2(t)-xf\t).
3. Reconstruct layers i and 2 by summing up contributions xi'^ and x^-*, respectively.
Fig. 3.9. Subtree (in black) of the dyadic tree of Fig. 3.3 (suppressed edges appear
as dotted lines).
88 Laurent Daudet and Bruno Torresani
Time
Fig. 3.10. MDCT coefficient domain, and corresponding tubes of significant MDCT
coefficients.
ai = {x,2pi) , bj = {x,Uj) ,
the laxter being processed further to get estimates for the coefficients ax
and f3s' This type of approach, which was taken in Berger et al. [34] and
3 Sparse Adaptive Representations 89
Daudet et al. [115], does not yield structured approximations, the coefficients
being processed individually. In the above-mentioned references, an iterative
approach was chosen, in which the MDCT layer was estimated first and re-
moved from the signal prior to the estimation of the wavelet layer. The diffi-
culty is then to provide a prior estimate for the number of MDCT coefficients
to retain for estimating the tonal layer. To this end, a transientness index was
proposed, based on entropic measures [468]. The latter actually provides an
estimate for the proportion of wavelet versus MDCT coefficients present in
the signal. Using this ingredient, the (unstructured) hybrid model estimation
procedure presented in Algorithm 3.3 is obtained.
3. Substract the tonal estimate from the signal to get the non-tonal estimate Xnton(^) =
X{t) -Xton(^).
4. Compute the wavelet coefficients of the non-tonal estimate an = (xnton, V^n); select
the Jw largest (in magnitude) ones a m i , , . a m j 3"<^ construct the transient
Jw
estimate Xtrans(0 = y^^an^Un^jt).
j=l
5. Substract the transient estimate from the signal to get the residual estimate
Xres\t) = 3:^nton(^) ~ ^^transV^j-
Berger et al. [34] also suggested a greedy approach in which several passes
of this two-step procedure are expected to yield more precise estimates for
the layers x^^^ and x^^^. More precisely, a first estimate of the tonal layer
is obtained by peaking the largest coefficients of an MDCT expansion. This
estimate is substracted from the signal, and a first estimate of the transient
layer is obtained from the largest wavelet coefficients of this residual. The tonal
estimate is then updated by peaking the largest coefficients of the MDCT
expansion of this 'second-order residual', and so on. The difficulty of such
approaches is mainly in answering the question 'how many large coefficients
should one keep at each step?' The transientness index alluded to above could
perhaps be used at this point, but to our knowledge this has not been done
up to now.
To estimate structured significance maps, coefficients have to be processed
jointly rather than individually. In [115], a functional on the space of connected
subtrees of the wavelet tree is proposed. Numerical optimization of this func-
tional yields estimates for significance trees A of wavelet coefficients, and thus
90 Laurent Daudet and Bruno Torresani
6eA
and the non-tonal layer reads
the probability that a node of the tree belongs to the significance tree A,
assuming that its parent belongs to A. The corresponding observed wavelet
3 Sparse Adaptive Representations 91
xeA
and the residual reads
Transient
0.2
0 i' i ^ ^ \ ^
-0.2
0 1 2 3 4 5 6
^s,n,fc(/) = e2^'^'^^(^-""^)^,(/-nns),
the window QS being a rescaled copy of ^, at scale s, and the indices {s, n, fc}
belonging to some fixed index set. A 'harmonic atom' is defined as a group of
M harmonically related atoms:
for choosing the best atom at every iterationis (approximately) the sum of
the squared inner products of the individual atoms, i.e.,
M
8000
7000
6000
> 5000
c
. 4000
u- 3000
2000
1000
0
time (s)
7r
6 o o o o o o o o 0
5 o o o o o o o o o
15 4 o o o o oo no (O 0CD 0<XS>0 o
o
" 3o o o o o o OQEXX) c o o OQOO o
2 o o o o o oo GD OOOD 000 0 o
-o o o o O O 0
7000
6000
>^
^ 5000
Qi
2 , 4000
> - 3000
2000
1000
7
time (s)
6 O o o o o o <a o
5 o o o o o o OD o o
^
(d
4 o o o o o o o > o o
^ 3 o o o o o o o D o o
2 o o o o o o o a> o o
1
^ o o o o o o D o o
1 - - MIM
^^^^^^^^^^^^^^^^^^
- ^ 4
IJHHaHHHHHHMHMHMHnX
4
g- 4000
[ -
'^" ^piwMMiawwwwMJi
1000 ^mmmmmmmmm J
Fig. 3.14. Result of the meta-molecular matching pursuit (MSP) algorithm on three
notes played by a clarinet. The molecules are shown as grey rectangles superimposed
on the spectrogram. Black boxes show the detected meta-molecules, which in this
very simple case correspond to the notes.
3.4 Conclusion
Acknowledgements
This work was supported in part by the European Union's Human Potential
Programme, under contract HPRN-CT-2002-00285 (HASSIP), and the Math-
STIC program of the French Centre National de la Recherche Scientifique^
in the framework of the project 'Approximations parcimonieuses structurees
pour le traitement de signaux audio'.
Part II
Stephen Hainsworth
4.1 Introduction
Imagine you are sitting in a bar and your favourite song is played on the
jukebox. It is quite possible that you might start tapping your foot in time
to the music. This is the essence of beat tracking and it is a quite automatic
and subconscious task for most humans. Unfortunately, the same is not true
for computers; replicating this process algorithmically has been an active area
of research for well over twenty years, with reasonable success achieved only
recently.
Before progressing further, it would be useful to define beat tracking
clearly. This involves estimating the possibly time-varying tempo and the
locations of each beat. In engineering terms, this is the frequency and phase
of a time-varying signal, the phase of which is zero at a beat location (i.e.,
where one would tap one's foot). When musical audio signals are used as an
input, the aim of 'beat-tracking' algorithms is to estimate a set of beat times
from this audio which would match those given by a trained human musician.
In the case where a notated score of the music exists, the musician is used as
a proxy for it (hopefully the musician's set of beats would align with those in
the score). Where no score exists, the musician's training must be accepted
to return a metre equivalent to how the music would be notated. Note that
this implies that it is the intended rather than the percieved beat structure
that is the focus here.
Beat tracking as just described is not the only task possible. Some algo-
rithms attempt only tempo analysisfinding the average tempo of the sam-
ple; others attempt to find the phase of the beat process and hence produce a
'tapping signal'. Meanwhile, some methods also attempt a full rhythmic tran-
scription and attempt to assign detected note onsets to musically relevant
locations in a temporally quantized representation. This is often considered
in terms of the score which a musician would be able to read in order to
recreate the musical example [352]. MIDI signals are also commonly used as
102 Stephen Hainsworth
Table 4.1. Summary of beat-tracking methods. Key for Input column: A = audio,
M = MIDI, and S = symboUc.
Approach Author and year [Ref] Input Causal
1) rule-based Steedman 1977 [607] S
Longuet-Higgins & Lee 1982 [418] S
Povel & Essens 1985 [529] S
Parncutt 1994 [497] S
Temperley & Sleator 1999 [622] M
Eck 2000 [165] S
2) autocorrelative Brown 1993 [55] S
Tzanetakis et al. 2001 [632] A
Foote 2001 k Uchihashi [194] A
Mayor 2001 [445] A
Paulus k Klapuri 2002 [503] A
Alonso et al. 2003 [15] A
Davies & Plumbley 2004 [118] A X
3) oscillating filters Large 1994 [390] M X
McAuley 1995 [450] M X
Scheirer 1998 [564] A X
Toiviainen 1998 [626] M X
Eck 2001 [166] A
4) histogramming Gouyon et al. 2001 [245] A
Seppanen 2001 [573] A X
Wang & Vilermo 2001 [661] A
Uhle & Herre 2003 [635] A
Jensen & Andersen 2003 [318] A X
5) multiple agent Allen k Dannenberg 1990 [14] M
Rosenthal 1992 [546] M
Goto et al. 1994 [221] A X
Dixon 2001 [148] A/M
6) probabilistic Laroche 2001 [392] A
Cemgil et al. 2000 [75], [76] M
Raphael 2001 [537] A/M
Sethares et al. 2004 [577] A
Hainsworth k Macleod 2003 [266] A
Klapuri 2003 [349] A X
Lam k Godsill 2003 [386] A
Takeda et al. 2004 [617] M
Lang k de Freitas 2004 [387] A
4 Beat Tracking and Musical Metre Analysis 105
r > >r^
*5r ^ ^
tatum
beat
bar
timing
However it is notated, the rate at which beats occur defines the tempo of the
music [404].
At a lower level than the beat is the tatum, which is defined to be the
shortest commonly occurring time interval. This is often defined by the l/8th
notes (quavers) or l/16th notes (semiquavers). Conversely, the main metrical
level above the beat is that of the bar or measure. This is related to the rate of
harmonic change within the piece, usually to a repeated pattern of emphasis
and also notational convention. Fig. 4.1 gives a diagrammatic representation
of the above discussion. Included is a set of expressive timings for the score
given. While obvious, it should also be noted that onsets do not necessarily
fall on beats and that beats do not necessarily have onsets associated with
them.
From here, metrical levels below the beat, including the tatum level, will
be termed the sub-beat structure, while the conversebar levels, etc.will be
labelled the super-beat structure. In between the tatum and beat, there may
be intermediary levels, usually related by multiples of two or three (compound
time divides the beat into three sub-beats, for instance). The same applies be-
tween the beat and bar levels. Gouyon [242] gives a comprehensive discussion
of the semantics behind the words used to describe rhythm, pointing out many
of the dualities and discrepancies of terminology. One point he raises is that
the terms beat or pulse are commonly used to describe both an individual
element in a series and the series as a whole.
An interesting point is raised by Honing [294], who discusses the duality
between tempo variations and timing: the crux of the problem is that a series of
expressively timed notes can be represented either as timing deviations around
a fixed tempo, as a rapidly varying tempo, or as any intermediate pairing. This
is a fundamental problem in rhythm perception and most algorithms arrive at
an answer which lies between the extremes by applying a degree of smoothing
to the processesthis usually means that estimated tempo change over an
analysis segment is constrained by the algorithm and any additional error in
expected timing of onsets is modelled as a timing deviation.
This leads to the concept of quantization, which is the process of assess-
ing with which score location an expressively timed onset should be associ-
ated. Here, score location refers to the timing position the onset would take
4 Beat Tracking and Musical Metre Analysis 107
instruments have a transient onset which has much in common with percussive
sounds. Percussive sounds are usually characterized by significant increases in
signal energy (a 'transient') and methods for detecting this type of musical
sound are relatively well developed. Harmonic change with httle associated
energy variation is much harder to reliably detect and has received less atten-
tion in the literature. Two recent studies of onset detection are Bello et al. [30]
and Collins [95].
While the discussion below assumes that a hard detection decision is made
as to whether an onset is present at a given location, the beat trackers dis-
cussed below which work on continuous detection functions also need to trans-
form the raw audio into something more amenable. They also process the
signal in ways similar to those described below but do not perform the step
of making hard onset detection decisions, instead leaving this to the later
beat-tracking process. The hard-decision onset detection method yields a set
of discrete onset times, whereas the latter method results in a continuous
function from which beat tracking is performed.
Transient events, such as drum sounds or the start of notes with a signifi-
cant energy change (e.g. piano, guitar), are easily detected by examining the
signal envelope. A typical approach, which is an adaptation of methods used
by a variety of other researchers [148], [392], [564], proceeds as follows: An
energy envelope function E^ (t) is formed by summing the power of frequency
components in the spectrogram for each time slice over the range required:
E,(n)=5]|STFT-(n,A:)p, (4.1)
kGKj
the gradient of Ej(n), and peaks in this function are detected. The linear
regression fits a fine Yi = a -\- hXi + e^ to a set of N data pairs; we are
only interested in the estimate of h which is given by 6 = (^2=1 ^i^i
^ ^ ^ ) / ( E i I i ^i - NX'^), where X and Y denote the average of X and F ,
respectively. In the case here, X is the equi-spaced set of time indices n in
Ej(n) and Y is the corresponding E j . In the case where N = 3, this reduces
to
D,(n) = E > + 1 ) - E , ( n - 1 ) _ ^^_2^
O
It should be noted that the commonly used technique of differencing the signal,
where T)j{n) = Ej{n) Ej(n 1), is simply linear regression with A^ set to 2.
The linear regression approach, like that of Klapuri [347], aims to detect the
start of the transient, rather than the moment it reaches its peak power.
Dj(n) is often called a detection function [30] and is a transformed and
reduced signal representation. Subsequent processing needs to detect the on-
sets contained within this. This is usually done by simply selecting maxima
in Dj(n) and discarding peaks which do not pass a series of tests. Low-energy
peaks should be ignored (for instance by testing if they are less than two times
the local 1.5-second average of Ej) and peaks can also be ignored if there is
a higher-energy peak in the local vicinity'^ by using Dixon's timing criterion
[148]. Thresholds and constants are usually heuristically determined and de-
signed to give reasonable performance with a large range of styles. Figure 4.2
shows an example of a peak extraction method. When several sub-bands j
are involved, the functions Dj(n) can be combined by half-wave rectifying
and across-band summing before the peak-picking process [37], [347].
^This is similar to the psychoacoustic masking thresholds found for humans [475],
[694].
110 Stephen Hainsworth
1500
0.6 h
S ^ 0.4h
S 0-2 h
-B Of'ti
O -0.2 h
-0.4 ^
l+lhlWfr HWH^ x J rM/'|if*k^
Fig. 4.2. Example of onset detection for transient events. The upper plot shows the
energy-based detection function, Ej(n); also shown are horizontal lines giving the
1.5 s local average of the energy function and x's showing the detected onsets. The
lower plot shows the gradient function Dj (n) from which peaks are found.
^ /M , f |STFT:(n,fc)| ^
(4.3)
(4.4)
fcG/C,d(fc)>0
where STFT^(n, k) is the STFT computed with window w. The measure em-
phasizes positive energy change between successive frames and /C defines the
spectral range over which the distance is evaluated (30 Hz to 5 kHz is sug-
gested as it represents the majority of clear harmonic information in the spec-
trum). Another advantage of this method is that it also takes into account
any transient energy which happens to be present as a useful aid.
A window length of about 90 ms is sufficient to give good spectral reso-
lution. To overcome frame to frame variation, histogramming of five frames
(weighted backwards and forwards with a triangular function) before and after
the potential change point was used and also a very short frame hop length
(namely, 87.5% overlap) was chosen to increase time resolution.
4 Beat Tracking and Musical Metre Analysis 111
50001
45001
40001
35001
-''^*SWH*-^^ifi*** ,^
^30001 ^^*^^:ifi:iiiiii|
i 2500|
Plpll
2 2000I
15001
- '.*^ ^i^^
iooof IwiiiMWi hppi
mik;^ . t ^
Fig. 4.3. Example of the output from the MKL harmonic change detection measure
for an excerpt of Byrd's 4-Part Mass. Onsets were missed at 1 s and 5.9 s while the
onset at 8.1 s is mis-estimated and should occur about 0.1 s later.
the beat and higher metrical levels from lists of onset times in a monophonic
melody. The rules were never implemented by the authors in the original pa-
per for more than five-bar examples, though there have since been several
papers by Lee which are summarized by Desain and Honing [141].
Parncutt [497] developed a detailed model for salience or phenomenal ac-
cent, as he termed it, and used this to inform a beat induction algorithm. Also,
he modelled medium tempo preference explicitly and combined these two in
a model to predict the tactus for a series of repeated rhythms played at dif-
ferent speeds. Comparison to human preferences was good. Parncutt's focus
was similar to that of Povel and Essens [529], while Eck [165] also produced
a rule-based model which he compared to Povel and Essens and others.
Temperley and Sleator [622] also used a series of rules to parse MIDI
streams for beat structure. They quoted Lerdahl and Jackendoff's generative
theory of tonal music (GTTM) [404] as the starting point of their analysis,
using the GTTM event rule (align beats with event onsets) and length rule
(longer notes ahgned with strong beats). Other rules such as regularity and a
number based on harmonic content were also bought into play. The aim was
to produce a full beat structure from the expressive MIDI input, and a good
amount of success was achieved.^
This was extended to be time varying, hence producing their 'beat spectro-
gram', which was a plot of the local tempo hypothesis versus time.
Other autocorrelation approaches include Mayor [445], who presented a
somewhat heuristic approach to audio beat tracking: a simple multiple hy-
pothesis algorithm was maintained which operated on his so-called BPM
spectrogram, BPM referring to beats per minute. Also Paulus and Klapuri's
method [503] for audio beat analysis utilized an autocorrelation-like function
(based on de Cheveigne's fundamental frequency estimation algorithm [135]),
which was then Fourier transformed to find the tatum. Higher-level metri-
cal structures were inferred with probability distributions based on accent
information derived using the tatum level. This was then used as part of an
algorithm to measure the similarity of acoustic rhythmic patterns. Brown [55]
used her narrowed autocorrelation method to examine the pulse in musical
scores. Davies and Plumbley [118] and Alonso et al. [15], [16] have also pro-
duced autocorrelation-based beat trackers.
The observed signal is a set of impulses s{n) = 1 when there is an onset event
and s{n) = 0 otherwise. The oscillator is given by
where o{n) defines an output waveform with pulses at beat locations with
width tuned by a; see Fig. 4.4. The phase is given by
where rji and r]2 are 'coupling strength' parameters. The update equations
enable the estimation of the unknown parameters p and rii. Marolt [433],
however, points out that oscillators can be relatively slow to converge because
they adapt only once per observation.
a = 1 , p = 10 a=10,p = 10 a= 1, p = 5
0'^ i
^^\ /^'^ ^'^\ /-^'^ ^^ QL
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
timet time t time t
Fig. 4.4. Example output signals o{t) generated using (4.8) for various values of a
and p.
Large's test data was a series of impulses derived from expressive MIDI
performances and the aim was to track the pulse through the example. An
extra level of complexity which allowed the system to continue following the
beat was to have a second oscillator 180 out of phase which could take over
control from the first if confidence dropped below a certain threshold.
McAuley [450] presented a similar adaptive oscillator model to that of
Large and indeed compared and contrasted the two models. Similarly, Toivi-
ainen [626] extended Large's model to have short- and long-term adaption
mechanisms. The former was designed to cope with local timing deviations
while the latter followed tempo changes. It was tested on expressive MIDI
4 Beat Tracking and Musical Metre Analysis 115
bins and count the number of lOIs which fall in each: h{j) = count(i, \oi
u{j-\-0.5)\ < O.bu) where u is the width of a bin. In contrast, Gouyon
et al. [247] and Hainsworth [263] treat the lOI data as a set of Dirac delta
functions and convolve this with a suitable shape function (e.g. a Gaussian).
The resulting function generates a smoothly varying histogram. This is defined
as h{j) = Xli^i *-^(i)5 where * denotes convolution and Af{j) is a suitable
Gaussian function (low variance is desirable). Peaks can then be identified and
the maximum taken as the tempo. Alternatively, Dixon [148] gives pseudocode
for an lOI histogram clustering scheme.
Seppanen [572] produced an archetypal histogramming method. After an
onset detection stage, he first extracted tatums via an inter-onset interval his-
togramming method. He then extracted a large number of features (intended
to measure the musical onset salience) with the tatum signal informing the
locations for analysis. These features were then used as the input to an al-
gorithm based on pattern recognition techniques to derive higher metrical
levels including the pulse and bar lines. Seppanen [573] gives further details
of the tatum analysis part of the algorithm. The final thing to note is that
the method was the first to be tested on a statistically significant audio data-
base (around three hundred examples, with an average length of about one
minute).
Gouyon et al. [247] applied a process of onset detection to musical audio
followed by inter-onset interval histogramming to produce a beat spectrum.
The highest peak (which invariably corresponded to the tatum) was then
chosen as the 'tick'. This was then used to attempt drum sound labelling
in audio signals consisting solely of drums [245], to modify the amount of
swing in audio samples [244], and to investigate reliable measures for higher
beat level discrimination (i.e., to determine whether the beat divided into
groups of two or three) [246]. Other histogramming methods include Wang
and Vilermo [661], Uhle and Herre [635], and Jensen and Andersen [318], all
of which present variations on the general approach and use the results for
different applications.
two most notable multiple agent architectures are those of Goto and that of
Dixon.
Goto has produced a number of papers on audio beat tracking of which
[221], [238], [240] are a good summary. His first method centred on a multi-
ple agent architecture where there were fourteen transient onset finders with
slightly varying parameters, each of which fed a pair of tempo hypothesis
agents (one of which was at double the tempo of the other). A manager
then selected the most reliable pulse hypothesis as the tempo at that in-
stant, thereby making the algorithm causal. Expected drum patterns as a
strong prior source of information were used and tempo was tracked at one
sub-beat level (twice the speed) as well as the pulse in order to increase
robustness.
This method worked well for audio signals with drums but failed on other
types of music. Thus, he expanded the original scheme to include chord change
detection [240], each hypothesis maintaining a separate segmentation scheme
and comparing chords before and after a beat boundary.
Dixon [148] has also investigated beat tracking both for MIDI and audio,
with the aim of out putting a sequence of beat times. The algorithm performed
well with a MIDI input, and with the addition of an energy envelope onset
detection algorithm, it could also be used for audio (though with lower perfor-
mance). The approach was based upon maintaining a number of hypotheses
which extended themselves by predicting beat times using the past tempo tra-
jectory, scored themselves on musical salience, and updated the (local) tempo
estimate given the latest observation. The tempo update was a function of
the time coherence of the onset, while the salience measure included pitch
and chord functions where the MIDI data was available. Hypotheses could be
branched if onsets fell inside an outer window of tolerance, the new hypothesis
assuming that the onset was erroneous and maintaining an unadjusted tempo.
Initialization was by analysis of the inter-onset interval histogram. Dixon has
also used his beat tracker to aid the classification of ballroom dance samples
by extracting rhythmic profiles [149].
This section will concentrate on some of the models developed rather than
details of the estimation procedures which are used to evaluate the final an-
swer, as these can often be interchangeable (a point made by Cemgil, who
used a variety of estimation algorithms with the same model [77]).
Again, the various methods can be broken down into two general groups:
those that work with a set of MIDI onsets (or equivalently a set of onsets
extracted from an audio sample) and those that work to directly model a
continuous detection function^ computed from the original signal.
Those who have worked on the problem include Cemgil et al. [77], who worked
with MIDI signals, and Hainsworth [263], who used Cemgil's algorithm as a
starting point for use with audio signals.
The crux of the method is to define a model for the sequential update of
a tempo process. This is evaluated at discrete intervals which correspond to
note onsets. The tempo process has two elements: the first defines the tempo
and phase of the beat process. The second is a random process which proposes
notations for the rhythm given the tempo and phase. A simple example of this
is that, given a tempo, the time between onsets could either be notated as a
quaver or a crotchet, one speeding the tempo up and the other requiring it to
slow down. The probabilistic model will propose both and see which is more
hkely, given the past data (and future if allowed).
The model naturally falls into the framework for jump-Markov linear sys-
tems where the basic equations for update of the beat process are given
by
(7n)^n-l+V, (4.12)
S = UnOn + e. (4.13)
{s} is the set of observed onset times, while On is the tempo process at
iteration (observed onset) n and can be expanded as
Pn
On (4.14)
An
Pn is the predicted time of the n^^ observation Sn, and A^ is the beat period in
seconds, i.e. A^ 60/pri where p^ is the tempo in beats per minute. ^^(Tn)
is the state update matrix, H^ = [ 1 0 ] is the observation model matrix, and
Vn and Cn are noise terms; these will be described in turn.
The principal problem is one of quantizationdeciding to which beat or
sub-beat in the score an onset should be assigned. To solve this, the idealized
r r r ^
^ p
Fig. 4.5. Figure showing two identical isochronous rhythms. The top rhythm is
much more likely in a musical notation context than the lower.
1 In (4.15)
^n(7n)
0 1
In Cn-l' (4.16)
While the state transition matrix is dependent upon 7^, this is a differ-
ence term between two absolute locations, Cn and Cn-i- Cn is the unknown
quantized number of beats between the start of the sample and the n*^ ob-
served onset. It is this absolute location which is important and the prior on
Cn becomes critical in determining the performance characteristics. This can
be elucidated by considering a simple isochronous set of onsetsif absolute
score location is unimportant, then the model has no way of preferring aligning
them to be on the beat over placing them on, say, the first semiquaver of each
beat. This is demonstrated in Fig. 4.5. Cemgil [77] broke a single beat into
subdivisions of two and used a prior related to the number of significant digits
in the binary expansion of the quantized location. In MIDI signals there are no
spurious onset observations and the onset times are accurate. In audio signals,
however, the event detection process introduces errors both in localization ac-
curacy and in generating completely spurious events. Thus, Cemgil's prior is
not rich enough; also, it cannot cope with compound time, triplet figures, or
swing. To overcome this, Hainsworth [263] broke down notated beats into 24
sub-beat locations, c^ = {1/24, 2 / 2 4 , . . . , 24/24, 25/24,...}, and a prior was
assigned to the fractional part of c^,
018
Jl
016
0.14 -
012
0.1
3
J)
^ 1 -
-
0.08
} -
006 J -
004 -
002 -
n
15
c,,
Fig. 4.6. Graphical description of the prior upon Cn- The horizontal axis is the
sub-beat location from 1 to 24, while the associated probability p(cn) is shown on
the vertical axis.
note lengths
tempo process
observations
Fig. 4.7. Directed acyclic graph of the jump-Markov linear system beat model.
The dependence between Cn and jn is deterministic, while other dependencies are
stochastic.
The tempo process has an initial prior, p{6o), associated with it. For the
purposes of a general beat-tracking algorithm, it is assumed that the likely
tempo range is 60 bpm to 200 bpm and that the prior is uniform within this
range.
So far, the model for tempo evolution and proposing a set of onset times
has been considered. Finally, the observation model must be specified. Sn is
the n^^ observed onset time and therefore corresponds to the pn in On- Thus,
Hn = [l O]- The state evolution error, v^, and observation error, e^, are
given suitable distributionsusually for mathematical convenience, these are
zero-mean Gaussians with appropriate covariances [26]. The overall model can
be summarized by a directed acyclic graph (DAG) as shown in Fig. 4.7. It
should be noted that even spurious onsets are assigned a score location.
4 Beat Tracking and Musical Metre Analysis 121
When working with real-world audio signals, more information than just
the onset times can be extracted from the signal, and this can aid the analysis
of the rhythm. The most obvious example is the amplitude of onsets while
others include a measure of chordal change and other 'salience' features as
postulated by Parncutt [497] and Lerdahl and Jackendoff [404]. Hainsworth
[263] utilized these in his model as a separate jump-Markov linear system for
amplitude and a zero-order Markov model for salience (here, the salience is
only a function of the current state and has no sequential dependency). There
has been little research into appropriate measures of salience for extracting
accents in music; other than the papers mentioned above, Seppanen [573] and
Klapuri [349] also proposed features which perform this function.
Given the above system, various estimation procedures exist. Cemgil [77]
described the implementation of MCMC methods as well as particle filters to
estimate the maximum a posteriori (MAP) estimate for the rhythm process,
while Hainsworth [263] utilized particle filters to find the MAP estimate for the
posterior of interest given hy p{ci:m Oi:n, (^i:n\si:m CLi-.m Si.n)^ where ai:n was
the underlying amplitude process observed as ai:n, and Si.n was the observed
set of saliences. Full details can be found in either of the publications.
Other similar methods include an earlier approach of Cemgil's [79] where
what he termed the 'tempogram' (which convolved a Gaussian function with
the onset time vector and then used a localized tempo basis-function*^ to
extract a measure of tempo strength over time) was tracked with a Kalman
filter [41] to find the path of maximum tempo smoothness.
Raphael's methods [537] were based around hidden Markov models where
a triple-layered dependency structure was used: quantized beat locations in-
formed a tempo process which in turn informed an observation layer. The
Markov transitions were learned between states from training data, and then
the rhythmic parse evaluated in a sequential manner to decide which was the
most likely tempo/beat hypothesis. This was tested on both MIDI and au-
dio (after onset detection) and success was good on the limited number of
examples, though manual correction from time to time was permitted.
Laroche [392], [393] used a maximum likelihood framework to search for
the set of tempo parameters which best fit an audio data sample. The in-
put was processed by typical energy envelope difference methods to extract
a list of onset times. Inter-onset times (which are phase independent) were
then used to provide likelihoods for the 2-D search space with discretized
tempo and swing as the two axes. This algorithm has been included in com-
mercially available Creative sound modules for several years. Lang and de
Preitas [387] presented a very similar algorithm to that of Laroche but used
a continuous signal representation and a slightly more complex estimation
procedure.
^The tempo basis function was defined as a set of weighted Dirac functions
V^(t;r, a;) = SJ^_oo ^*^^+*2'^(^) ^^ ^ delay of r and spaced with frequency (and
hence tempo) given by uj.
122 Stephen Hainsworth
The second approach to tracking the beat with stochastic models uses a detec-
tion function and attempts to model this directly instead of extracting onsets
first. As such it must have all the elements of the above models, including a
tempo process and a model for the likelihood of an onset being present at any
given beat or sub-beat location; however, it must also have a model for the
signal itself and what is expected at an onset and between these.
Hainsworth [263] proposed a method using particle filters whereby the
tempo was modelled as a constant velocity process similar to the one described
above and which proposed onsets in a generative manner at likely sub-beat
locations. The signal detection function modelled was a differenced energy
waveform, utilizing high-frequency information, very similar to T)j{t) shown
in the lower plot of Fig. 4.2.
Onset locations can clearly be seen in this signal representation, and on
close examination all onsets have a very similar evolution in time which can
be well modelled by a hidden Markov model (HMM; see Chapter 2 for a
definition). This is performing the task of onset detection. The model used is
shown in Fig. 4.8 with each state having a different output distribution (also
termed likelihood). For mathematical convenience, these are Gaussians with
differing means and variances but sufficiently separated so that the output
distribution of state ^i does not significantly overlap with that of ^o or 5*2,
etc. This defines a generative model for the signalby generative, it is meant
that by using a random number generator and the specified distributions,
a process with the same statistical properties as the original signal can be
generated.
A naive scheme simply generates proposals from the prior distributions,
but the Viterbi algorithm (see [654] and Chapter 2) can be used to find the
best path through the HMM and also its probability, which simplifies the
calculation needed once an onset is hypothesized. The model worked well
on the small number of examples tried but required the expected sub-beat
structure to be specified by hand for robust performance.
In comparison, Sethares et al. [577] proposed four filtered signals (time do-
main energy, spectral centroid, spectral dispersion, and one looking at group
delay) which were then simply modelled as Gaussian noise with a higher vari-
ance at beat locations compared to between them. Looking back at Fig. 4.2,
it can clearly be seen where the variance of the generative noise process used
to model the signal would be higher. A model similar to those above was used
and a particle filter environment chosen for the estimation procedure. The
4 Beat Tracking and Musical Metre Analysis 123
Fig. 4.8. HMM for beat-tracking algorithm with Viterbi decoding included. States
55, 56, and 5? are functionally equivalent to 54, and 58 is equivalent to So- The
null state, 59, has no observation associated with it, therefore making transition to
it highly unattractive.
model did not explicitly include a model for sub-beats but seemed to function
well on the data presented.
A somewhat different method for tracking the beat through music was
presented by Klapuri et al. [348], [349]. A four-dimensional observation vector
(as a function of time) was generated by applying a similar method to that
of Scheirer [564] to generate resonator outputs but using different frequency
bands and a different method for extracting the energy signal which also cap-
tures harmonic onsets. A measure of salience, dependent upon the normalized
instantaneous energies of the comb-filter resonators, was also attached to this.
A problem with Scheirer's method was that it was prone to switch be-
tween different tempo hypotheses (usually doubling or halving), and Klapuri
addressed this using an HMM to impose some smoothness to the tempo evo-
lution. He proposed a joint density for the estimation of the period-lengths
of the tatum, tactus, and measure level processes, applying a combination of
sensible priors and dependencies learned from data. The phase of the tatum
124 Stephen Hainsworth
and tact us pulse were estimated to maximize the observed salience at beats. In
estimating the phase of the super-beat (measure) structure, a key assumption
made was the expectation of two simple beat patterns which occur frequently
in so-called 4/4 time. While this should considerably aid performance with
music in this time signature, performance in the super-beat estimation was
degraded for examples with a ternary metre (e.g. 3/4). Nevertheless, the algo-
rithm was tested on a significant database and was successful. A comparison
is presented below.
4.11.1 Tests
There has been a move in recent years towards testing algorithms with a large
database of audio samples collated from all genres and usually from standard,
4 Beat Tracking and Musical Metre Analysis 125
commercially available sources. This was begun with Seppanen [572] with a
database of 330 audio samples, while Klapuri [349] used 478. A comparison
was also undertaken by Gouyon et al. [248] into tempo induction from audio
signals using a large dataset of 3199 examples from three databases and is
currently the most extensive.
The comparison below used a hand-labelled database of 222 samples of
around one minute divided into six categories: rock/pop, dance, jazz, folk,
classical, and choral. The tempos were limited to the range 60-200 bpm with
the exception of the choral samples. Several examples exhibited significant ru-
bato, 8 had a rallantando (slowing down), and 4 had a sudden tempo change.
Forty-two also had varying amounts of swing added. Full details of the data-
base can be found in [263].
Another problem is how to evaluate the performance of a beat-tracking
algorithm. As of this writing, no study has yet made a serious attempt to no-
tate the complete rhythm and idealized score locations of every onset present
in the audio sample;^ rather the assessment has been limited to 'tapping in
time' to the sample and producing an output of beat times that agrees with
those of trained human musicians.
Klapuri [349] gives two criteria, which are adopted here, to judge the per-
formance of an algorithm on a particular example. The first is 'continuous
length' (C-L), by which it is meant the longest continually correctly tracked
segment, expressed as a percentage of the whole. Thus, a single error in the
middle of a piece gives a C-L result of 50%. Another, looser criterion is sim-
ply the total percentage of the whole which is correctly tracked (defined as
'TOT' from now on). Here, both are expressed as percentages of the manually
detected beats which are correctly tracked, rather than of the time stretches
these represent. Using Klapuri's definitions once again, a beat is determined
to be correctly tracked if the phase is within 15% and the tempo period is
correct to within 10%.
Here, the trackers^ of Scheirer [564], Klapuri [349], and Hainsworth [263]
are compared and the results are shown in Table 4.2. The columns under 'Raw'
are base results according to the above criteria; however, it is sometimes found
that the beat tracker tracks something which is not the predefined beat but is
a plausible alternative. Usually, this is half the correct tempo (in the case of
fast samples) or double (for particularly slow examples). When swing is en-
countered, it is occasionally possible for the trackers to even track at one and a
half times the tempo (i.e., tracking three to every two correct beats). Doubling
or halving of tempo is psychologically plausible and hence acceptable; however
the errors encountered with swing are not. The second set of columns com-
pares results once doubling and halving of tempo are allowed. Performance
on individual genres is shown graphically in Fig. 4.9 for Hainsworth's and
Klapuri's algorithms.
50
I ^^^
100
i
150
': l'' >: K: 200
Total percentage correct
100
801-
' 6o^
. 1 1
1
1 ^\ \ \
I \
: 4ol 1 1
\ 1
\ 1
201 1 1
1 50
1 1
V Choral
40 H-Dance ' 1 0
Rock/pop
20 ^ Jazz 1 \ FollA'
1 1
50 100 150 200
Total percentage correct
80
60 M'^
- '
\ ^~"\
'^ \ '
\
\s
1'
1'
||
"
ii
40 - \ \
1; ! Y',". -
20
- 1
(3 50 100
\
Song index
1
\ \
50
,1 \'i^; 200
F i g . 4.9. Graphical display of the results for Hainsworth's (top) and Klapuri's beat
tracker. The solid line is the raw result while the dashed line is the 'allowed' result.
Note that ordering is strictly by performance for each genre under any particular
criteria.
4 Beat Tracking and Musical Metre Analysis 127
Table 4.2. Comparison of results on the database. The three beat trackers use
audio adata as inputs.
Raw Allowed
C-L (%) TOT (%) C-L (%) TOT (%)
Hainsworth 45.1 52.3 65.5 80.4
Scheirer 23.8 38.9 29.8 48.5
Klapuri 55.9 61.4 71.2 80.9
It can be seen that Klapuri's model performs the best in terms of raw
results and continuous tracking, while the performance of Hainsworth when
considering total number of beats with allowed tempo mistakes is about equiv-
alent. Klapuri's method performs better than Hainsworth's with rock/pop and
dance, though it fails somewhat with jazz. Hainsworth's outperforms Klapuri's
on choral music, probably because of the onset detection algorithm used by
Hainsworth (described above in Section 4.4), which gives superior performance
for these choral samples.
Both Klapuri's and Hainsworth's models significantly outperform Scheirer's.
Klapuri [349] compared his model to Scheirer's and also Dixon's [148] mod-
ified MIDI beat-tracker. Seppanen [572] reported that his program was less
successful than Scheirer's, tested on a large database that was a subset of
Klapuri's. Also, on the related issue of tempo induction, the comparison by
Gouyon et al. [243] showed that Klapuri's method performed the best at this
task.
Finally, performance of one of the stochastic models which uses a sig-
nal representation is shown on a single example in Fig. 4.10. This shows
Hainsworth's second stochastic model (described above in Section 4.10.2) with
a swing example. The model is very successful at extracting onsets and is
good at tempo tracking. The limitation is that the expected sub-beat struc-
ture has to be specified in advance. Thus, the model cannot be considered
pan-genre.
4.12 Conclusions
10 20 Time (s) 30 40 50
0.4
0.2
0
n\fwi^ |44w
-0.2
2 4Time(s) 6 8 10
Fig. 4.10. Output of Hainsworth's second stochastic beat tracker (see Sec-
tion 4.10.2) for a swing example, a) shows tracked tempo (dashed) and hand-labelled
tempo (solid); b) shows the onset detection process for the first 10 seconds with solid
vertical lines denoting detected beats and dashed vertical lines showing the detected
swung quavers.
and classical music (which is prone to radical rhythmic evolution and also
has fewer easily extractable beat cues). Classical music particularly seems
to require pitch analysis in order to extract reliable beat cues. Thus, while
the aim is obviously to have a generic beat tracker which works equally well
with all genres, it is likely t h a t in the short term, style-specific cues will have
t o be added. Klapuri [353] and Goto [221] b o t h apply knowledge of typical
d r u m p a t t e r n s in popular music to their algorithms. Dixon [149] goes a step
further and uses rhythmic energy p a t t e r n s extracted from audio samples to
aid classification of ballroom dance examples, a process which could easily be
reversed to aid beat tracking.
In addition to better modelling specific styles and the rhythmic expecta-
tions therein, the second area for expansion is to look at better signal rep-
resentations for extracting the cues needed to perform beat tracking. Rock
and p o p music with its drum-heavy style is easily processed using energy
measures; classical music is much harder to process and only relatively re-
cently have methods been applied to extract note changes where there is little
transient energy. These will need to be improved.
4 Beat Tracking and Musical Metre Analysis 129
5.1 Introduction
0.2 0.4 1 2 3
Time (s) Time (s)
Fig. 5.1. Example waveforms. The images on the top row are the time domain
waveforms of a kick drum, a snare drum, and a crash cymbal, from left to right. The
lower row contains the corresponding spectrograms in the same order. The sound
samples are from the RWC Musical Instrument Sound Database [230].
as an impulse function and so a broad range of frequencies will occur in the im-
pact. As a result, all possible modes of vibration of the plate or membrane are
excited simultaneously, and the narrower the frequency band associated with
a given mode, the longer it sounds. The interested reader is referred to [193]
for a mathematical discussion of the properties of ideal membranes and plates.
Many of the membranophones used in a standard rock/pop drum kit can
be tuned by adjusting the tension of the membrane. In conjunction with the
different sizes available for each drum type, this results in considerable varia-
tions in the timbre obtained within a given drum type. Nevertheless, it can be
noted that the membranophones have most of their spectral energy contained
in the lower regions of the frequency spectrum, typically below 1000 Hz, with
the snare usually containing more high-frequency energy than other mem-
branophones. Also, in the context of a given drum kit, the kick drum will
have a lower spectral centroid than the other membranophones. It can also
be noted that idiophones consisting of a metal plate will typically have their
spectral energy spread out more evenly across the frequency spectrum than
the membranophones, resulting in more high-frequency content.
Examples of three different drum instruments' time domain waveforms and
spectrograms are shown in Fig. 5.1. A kick drum is purely a membranophone,
containing a lot of low-frequency energy. A snare drum is also a membra-
nophone, but it has a snare belt attached below the lower membrane. When
the drum is hit, the lower membrane interacts with the snare belt, resulting in
a distinct sound also containing high-frequency energy. This can be observed
5 Unpitched Percussion Transcription 133
or do not have a constant time difference, it is good to limit the minimum and
maximum length of the segments. This guarantees that each of them contains
enough information for extracting relevant features. For example, good initial
guesses for the minimum and maximum lengths could be 50 ms and 200 ms,
respectively. A window function can be used in connection to this. However,
a traditional Hamming or Hanning window, for example, is not appropriate
since it smooths out the informative attack part at the beginning of the seg-
ment. A half-Hanning window which starts from a unity value and decays
to zero at the end of the segment is more suitable, but often windowing is
omitted completely, assuming that events decay to small amplitude naturally
and the signal does not contain sustained sounds at all.
71 = ^3 (5-4)
72 = ^^4 . (5.5)
The smaller the kurtosis, the flatter the spectrum. The quantities (5.2)-(5.5)
can also be calculated using a logarithmic frequency scale, as suggested in the
MPEG-7 standard [307].
In comparison with the spectral features, relatively few time-domain fea-
tures have been used in percussive sound classification. Instead, temporal
evolution of the sound is often modelled using differentials of spectral features
extracted in short frames over the segment. Among the features that can be
computed in the time domain, the two most commonly used are temporal
centroid and zero crossing rate. The temporal centroid, a direct analogue to
the spectral centroid, describes the temporal balancing point of the sound
event energy by
where E{t) denotes the root-mean-square (RMS) level of the signal in a frame
at time t, and the summation is done over a fixed-length segment starting
at the onset of the sound event. The feature enables discrimination between
short, transient-like sounds and longer ringing sounds. The zero crossing rate
describes how frequently the signal changes its sign. It correlates with the
spectral centroid and the perceived brightness of the signal. Usually, noise-
like sounds tend to have a larger zero crossing rate than more clearly pitched
or periodic sounds [250].
5 Unpitched Percussion Transcription 137
Feature set is generally selected through trial and error, though some au-
tomatic feature selection algorithms have also been evaluated by Herrera et
al. [287]. It was noticed that in most cases, using a feature set that has been
chosen via some feature selection method yielded better results than using all
the available features. Also, a dimension reduction method such as principal
component analysis can be applied to the set of extracted features prior to
classification. For a more detailed description of feature selection methods and
possible transformations, refer to Chapter 2 and Chapter 6.
The extracted features are then used to recognize the percussive sounds in
each segment. There are at least two different ways to do this. The first is
to try to detect the presence of a given drum, even if other drums occur at
the same time, and the other is to attempt to recognize drum combinations
directly. For example, if an input signal consists of snare and hi-hat sounds,
the first approach will attempt to recognize the presence of both instruments
independently from each other, while the latter will attempt to recognize
whether 'snare', 'hi-hat', or 'snare + hi-hat' has occurred, treating sound
combinations as unitary entities.
A problem that arises when recognizing drum combinations instead of
individual drums is that the number of possible combinations can be very
large. Given M different drum types which may all occur independently, there
are 2^ possible combinations of them. That is, the number of combinations
increases rapidly as a function of M, and it becomes difficult to cover them all.
In practice, however, only a small subset of these combinations are found in
real signals. Figure 5.2 illustrates the relative occurrence frequencies of the ten
most common drum event combinations in a popular music database. These
contribute 95% of the drum sound events in the analysed data. When focusing
on the transcription of the drums commonly used in Western popular music,
the number of possible sound types M has usually been limited to the range
of two to eight. Some systems have concentrated on transcribing only the
kick and snare drum occurrences [235], [250], [221], [693], [608], [683], whereas
some others have extended the instrument set with hi-hats or cymbals [620],
[505], or added even further classes such as tom-toms and various percussion
instruments [506], [209].
Classification algorithms can be roughly divided into three different cate-
gories:
decision tree methods,
instance-based methods, and
statistical modelling methods.
With the exception of the work by Herrera et al. [287], there has not been an
extensive comparison of different classification methods as applied on percus-
sive sounds. Also, the experiments done in [287] concentrated on the classi-
138 Derry FitzGerald and Jouni Paulus
H BH HS S B BC BHS C T BCH
Instrument combination
Fig. 5.2. The relative frequencies of the ten most frequently occurring drum sound
combinations in the RWC Popular Music Database [229]. These combinations con-
tribute 95% of all drum sound events present in the database. A total of five drum
classes were used in the calculations, and are denoted as following: H is hi-hat, B is
kick drum, S is snare drum, C is cymbal, and T is tom-tom.
^Tablas consist of a metallic bass drum and a wooden treble drum. Different
hand strokes on these drums produce different sounds.
140 Derry FitzGerald and Jouni Paulus
Even though isolated percussive sounds can be identified quite reliably [285],
real-world recordings are not as easy to analyse. This is due to other simul-
taneously occurring interfering sounds, both other drums and melodic instru-
ments, as well as the fact that drum sounds can vary between occurrences,
depending on how and where they are struck. As a consequence, it is difficult
to construct general acoustic models that would be applicable to any data
and still discriminate reliably between different instruments.
A way to overcome this problem is to train the models with data that is as
similar as possible to the target mixture signals. However, this is not possible
if the exact properties of the target signals are not known in advance or they
vary within the material. Model adaptation has been proposed to alleviate
this problem. In this approach, the idea is to adapt general models to the
mixture signal at hands, instead of using fixed models for each and every
target signal. To date, only three event-based drum transcription systems
have been proposed that take this approach [693], [561], [683].
The earliest percussion transcription system utilizing model adaptation
was that of Zils et al. [693], which used an analysis-by-synthesis approach.
Initially, simple synthetic percussion sounds Zi{n) were generated from low-
pass and bandpass-filtered impulses. These represented very simple approxi-
mations to kick drums and snares respectively, and were then adapted to the
target signal to obtain more accurate models. The algorithm operated with
the following steps:
1. Calculate correlation function between a synthetic sound event Zi{n) and
the polyphonic input signal y{n)
Ni-l
^iW = X I ^iH2/(^ + r), (5.7)
n=0
where Ni is the number of samples in the sound i, and rj(r) is defined for
r [0, Ny Ni], where A^^^ is the number of samples in y{n).
5 Unpitched Percussion Transcription 141
1 ^
Zi{n) ^ 2 :0,...,iV,-l, (5.8)
U
J
x = E Y ; = Eb^gJ- (5.11)
3=1 i=l
144 Derry FitzGerald and Jouni Paulus
> 1
1 1
1 1
1.5 2
Times
Fig. 5.3. Magnitude spectrogram of a drum loop containing snare and kick drum.
100 200 300 400 500 600 700 800 900 1000
Frequency Hz
Fig. 5.4. Basis functions recovered from the spectrogram in Fig. 5.3. Prom top
to bottom, they are the kick drum ampUtude basis function, the snare drum am-
phtude basis function, the kick drum frequency basis function, and the snare drum
frequency basis function.
incorrectly detected drums. Using this measure, an overall success rate of 90%
was achieved. However, more effective means of incorporating prior knowledge
were subsequently developed, and are discussed in the following subsections.
prj^j (5.12)
5 Unpitched Percussion Transcription 147
X^BprG, (5.13)
G = Bpr+X, (5.14)
G-WG, (5.15)
where W is the unmixing matrix obtained from ICA and G contains the inde-
pendent amplitude basis functions. This results in amplitude basis functions
which are generally associated with a single source, though there will still be
some small traces of the other sources. Improved estimates of the frequency
basis functions can then be obtained from
B = XG+. (5.16)
In this case, the use of the pseudoinverse is justified in that the columns of G"^
are orthogonal and do not share any information, and the pseudoinverse can
be calculated as G"^ = G^(GG^)~^. The overall procedure can be viewed as
a form of model adaptation such as is described in Section 5.2.4.
Figure 5.5 shows a set of priors for snare, kick drum, and hi-hat, respec-
tively. These priors were obtained by performing ISA on a large number of
isolated samples of each drum type and retaining the first frequency basis
function from each sample. The priors shown then represent the average of
all the frequency basis functions obtained for a given drum type. Priors could
be obtained in a similar way using some other matrix-factorization technique
such as NMF (see Chapter 9 for further details). It can be seen that the priors
for both kick drum and snare have most of their energy in the lower regions
of the spectrum, though the snare does contain more high-frequency informa-
tion, which is consistent with the properties of membranophones, while the
hi-hat has its frequency content spread out over a wide range of the spectrum.
The use of prior subspaces offers several advantages for percussion in-
struments. First, the number of basis functions is now set to the number of
148 Derry FitzGerald and Jouni Paulus
1 AMi \k
0.8
0.6
0.4
i'VVvw^/ kP ^
\u.
n^i 'yf(
^L^f^^
0.2 MM\J^
"""^"'^^
' " ^ " ^ .
0 5 10 15 20
Frequency kHz
Fig. 5.5. Prior subspaces for snare, kick drum, and hi-hat.
prior subspaces used. Second, the use of prior subspaces alleviates the bias
towards sounds of high energy inherent in blind decomposition methods, al-
lowing the recovery of lower-energy sources such as hi-hats. Thus, the use of
prior subspaces can be seen to go some distance towards overcoming some of
the problems associated with the use of blind separation techniques, and so
is more suitable for the purposes of percussion transcription.
A drum transcription system using PSA was described in [188]. Again,
the system only transcribed signals containing snare, kick drum, and hi-hats
without the presence of any other instruments. Prior subspaces were generated
for each of the three drum types, and PSA performed on the input signals.
Once good estimates of the amplitude basis functions had been recovered,
onset detection was carried out on these envelopes to determine when each
drum type was played. To overcome the source-ordering problem inherent in
the use of ICA, it was again assumed that the kick drum had a lower spectral
centroid than the snare, and that hi-hats occurred more frequently than the
snare. When tested on the same material as used with sub-band ISA (see
p. 145 for details), a success rate of 93% was achieved.
More recently, an improved formulation of PSA has been proposed for the
purposes of drum transcription [505], based on using an NMF algorithm with
5 Unpitched Percussion Transcription 149
priorly fixed frequency basis functions Bpr- The NMF algorithm estimates
non-negative amphtude basis functions G so that the reconstruction error of
the model (5.13) is minimized.
This offers a number of advantages over the original formulation of PSA.
First, the non-negative nature of NMF is more in keeping with the data being
analysed, in that the spectrogram is non-negative, and so a decomposition
that reflects this is likely to give more realistic results. Second, keeping Bpr
fixed eliminates the permutation ambiguities inherent in the original PSA
algorithm. This allows the elimination of the assumptions necessary to identify
the sources after separation, and permits the algorithm to function in a wider
range of circumstances.
The NMF-based algorithm estimates G by first initializing all its elements
to a unity value and then iteratively updating the matrix using the rule
G.x B p / ( x . / ( B p r G ) ) ] ./ [ B p / l ] , (5.17)
calculated from the same set of unprocessed samples. For the NMF-based
system, onset-detection thresholds were obtained by analysing a set of training
signals and setting the threshold to a value minimizing the number of detection
errors. For PSA, the source-labelling rules and fixed threshold values from the
original publication were used. The SVM-based classifier was trained so that
when analysing the dry (wet) signals, the training features were also extracted
from the dry (wet) signals. This may have given the SVM method a shght
advantage compared to the other systems.
The NMF-based system performed best of the methods, with the dry mix
material having a hit rate of 96% compared to the 87% of the SVM method
and 67% of PSA. The performance gap became smaller with the production-
grade mixes, but the NMF-based method still had a hit rate of 94% compared
to the 92% of SVM method and 63% of PSA [505].
these drums, and then onset detection on the hi-hat subspace recovered from
the PSD-normahzed spectrogram. As the hi-hat subspace no longer under-
goes ICA with the other drums, the algorithm loses the ability to distinguish
between a snare on its own and a snare and hi-hat occurring simultaneously.
Fortunately, in many cases these drums do occur simultaneously and so this
results in only a small reduction in the efficiency of the algorithm. When tested
on a database of 20 excerpts from pop and rock songs taken from commercial
CDs, an overall success rate of 83% was achieved.
Attempts to extend the basic PSA method to include other drums such
as tom-toms and cymbals met with mixed success. Extensive testing with
synthetic signals revealed that this was due to the fact that when the main
regions of energy of different sources overlap, as is often the case with drums
such as snares and tom-toms, then the sources will not be separated correctly
[186].
As traces of both snare and tom-toms will occur in the idiophone envelope,
an amplitude envelope for snare/tom-toms is obtained by masking kick drum
events in the original spectrogram, and multiplying the resulting spectrogram
by a snare frequency subspace, again using (5.18). ICA is then performed to
separate the snare/tom-tom amplitude envelope and the idiophone amplitude
envelope. Onset detection on the resulting independent idiophone envelope
then yields the idiophone events.
Grouping is then carried out on the idiophones. If two large groups occured
that did not overlap in time, then both hi-hat and ride cymbal were assumed
to be present; otherwise all events were allocated to the same drum. The
justification for this is detailed in [190]. Unfortunately, though the algorithm
distinguished between ride cymbal and hi-hats, it did not identify which was
which. When tested on a database of 25 drum loops, a success rate of 90%
was obtained using the same measure as sub-band ISA.
Dittmar et al. described a system which attempted to transcribe drums in
the presence of pitched instruments [147]. To enable recovery of low-energy
sources such as hi-hats and ride cymbals, the high-frequency content of the
signal was boosted in energy. A magnitude spectrogram of the processed sig-
nal was obtained, and then differentiated in time. This suppressed some of
the effects of the sustained pitched instruments present in the signal, be-
cause their amplitudes are more constant on a frame-by-frame basis than
that of transient noise, and so when differentiated will have a smaller rate of
change.
Onset detection was then carried out and the frame of the difference spec-
trogram at each onset time extracted. As the extracted frames contain many
repeated drum events, PCA was used to create a low-dimensional represen-
tation of the events. J frequency components were retained and non-negative
ICA [526] performed on these components to yield B, a set of independent
basis functions which characterized the percussion sources present in the
signal. The amplitude envelopes associated with the sources were obtained
from
G - B'^X, (5.19)
where G are the recovered amplitude envelopes, and X is the original spec-
trogram. A set of differentiated amplitude envelopes was then recovered
from
G - B'^X', (5.20)
where G are the differentiated amplitude envelopes and X' is the differentiated
spectrogram. Correlation between G and G was used to eliminate recovered
sources associated with harmonic sounds, as sustained harmonic sources will
tend to have lower correlation than percussive sources.
5 Unpitched Percussion Transcription 153
' [b^]fc-f6
(5.21)
k=l I
[^pr i]k + e
sequences of length N is called the A^-gram model. These can be used to assess
the likelihoods of different event sequences in percussion transcription, or to
predict the next event.
If the events in the sequence are drawn from a dictionary of size D, there
are D^ probabilities that need to be estimated for an TV-gram of length N.
This imposes requirements on the size of the training data set in order that the
resulting A/'-grams do not contain too many zero-probability entries. Usually,
such entries cannot be completely avoided, so methods for reducing their effect
have been developed. The zero-probabilities can either be smoothed (given a
non-zero value) with a discounting method like Witten-Bell discounting [673]
or Good-Turing discounting [218], [90], or the required probability can be
estimated from lower-order A'-grams with the back-off method suggested by
Katz [336] or with the deleted interpolation algorithm by Jelinek and Mercer
[317]. An interested reader is referred to the cited publications for details.
The sound event A"-grams in music analysis are directly analogous to the
word A^-grams in speech recognition. Moreover, as the words in speech are con-
structed from individual letters or phonemes, the mixture-events in percussive
tracks may consist of multiple concurrent sounds from different instruments.
The main difference between these two is that in speech recognition, the order
of individual letters is important and the letters in consecutive words rarely
have any direct dependence, whereas in musicological A'-grams, the mixture-
events consist of co-occurring sound events which alone exhibit dependencies
between the same sound event in the neighbouring mixtures. This observa-
tion can be utilized to construct AT-gram models for individual instruments,
as suggested by Paulus and Klapuri [506].
When the set of possible instruments {ui,U2,..., UM} is defined, with the
restriction that each instrument can only occur once in each mixture-event,
the problem of estimating mixture-event A^-grams can be converted into the
problem of estimating A'-grams for individual instruments. In other words,
at each time instant k the instrument Ui has the possibility of being present
in the mixture Wk or not. By using this assumption, the probability estimate
from (5.23) becomes
p{Wk\Wk-N+l:k-l) = Yl Pi^iM'^hk-N-^l-.k-l)
Uj ^Wk
each model has only a binary dictionary, leading to a total of M2^ probabili-
ties to be estimated. This alleviates the zero-frequency problem significantly:
the sharply concentrated prior distribution of different mixture-events (see
Fig. 5.2) means that some of them occur too rarely for reliable probability
estimation, even in a large training set.
The main problem with individual instrument A^-grams is that they lack
information about simultaneously occurring events. Each A^-gram 'observes'
only the presence of its own instrument without any knowledge of the other
co-occurring instruments. As a result, the model may give overly optimistic
or pessimistic probabilities to different mixture-events. One possible way to
address this problem is to use the prior probabilities of the mixture-events in
connection with (5.25), as proposed in [506].
Musicological prediction can also be done using simpler modelling. For
example, if two occurrences of the same event type took place with time
interval t/^, its occurrence can be predicted again after another interval of IA-
A system relying on this type of modelling was proposed by Sillanpaa et al.
in [590].
The fact that A^-grams only use the directly preceding events to predict the
next event is a minor drawback, considering their usage in music or percus-
sive sound analysis. In particular, the percussive content of music generally
exhibits repeating patterns. Even though they contain the same sequential
data within the musical piece, the patterns tend to vary between pieces. As
a result, temporal prediction operating on immediately preceding events may
not be the most efficient way to model repeating rhythmical patterns.
Based on the above observation, Paulus and Klapuri proposed the use of
periodic A^-grams [506] where, instead of using the directly preceding A^ 1
events, the idea is to take the earlier events separated by an interval L. That is,
when predicting the event at temporal location k. instead of using the events
at locations k N + l^k N-[-2,...,k 1. use the events at the locations
k-{N - 1)L, k-{N - 2 ) L , . . . , A: - L. The A^-gram model of (5.23) is then
reformulated as
isure 2 BH H H H @ HC SH HT
isure 3 BHC H SH H ^ H SH H
isure 4 BH H @~ -(uy"*"?\
Fig. 5.6. The idea of the normal and periodic A^'-grams illustrated. Time is quantized
to a tatum grid, each box representing a segment between two grid points. Time
flows from left to right continuing on the next row, each row being one musical
measure. The letters represent the drum instruments played at the corresponding
time instants (B is kick drum, S is snare drum, H is hi-hat, T is tom-tom, and C
is cymbal). The horizontal arrow represents a normal trigram prediction, and the
vertical arrow represents a periodic trigram prediction. The measure length here is
eight tatum periods and L = 8 is set accordingly.
where p{M) is the likelihood of the mapping, and p{q\n^ A) is the probability
of the label q to be present at the temporal location n G { 0 , . . . , A 1} when
the length of the musical measure is A. The total likelihood is calculated over
the whole signal containing all the events z.
The system was evaluated with acoustic signals synthesized from a com-
mercial MIDI database comprising a wide variety of different percussive tracks
[305]. The synthesis was done by using sampled speech sounds and the sounds
of tapping different objects in an ofRce environment. There were fifteen sam-
ples for each sound type, and each synthesized hit was randomly selected
^In the general case, K does not need to be equal to the number of available
labels.
160 Derry FitzGerald and Jouni Paulus
from this set to produce realistic acoustic variation to the synthesis result.
The overall error rate of the system was 34%. Error analysis revealed that
there were large differences in performance between different genres, and the
genres with simpler rhythmic patterns were labelled more accurately.
It has been established that musicological modelling is useful in the con-
text of percussive sound transcription. However, the low-level analysis has to
be done sufficiently accurately before the musicological modelling can really
improve the results obtained. Further, the methods required to combine low-
level acoustic recognition with the high-level modelling still need development.
As percussive patterns tend to be different in different time signatures, styles,
and genres, specific models for each of these could be developed.
5.5 Conclusions
An overview of the current state of the art in unpitched percussion transcrip-
tion has been presented. This encompassed both event-based and separation-
based systems, as well as efforts to include high-level language modelling to
improve system performance. As can be seen, there has been considerable
effort expended on tackling the problem of percussion transcription in the
past few years, and a summary of the important systems to date is presented
in Table 5.1.
At present, the best performance has been obtained on systems that focus
on a reduced number of drums: snare, kick drum, and hi-hats in the drums-
only case, and snare and kick drum in the presence of pitched instruments.
This is unsurprising in that the complexity of the problem is greatly reduced
by limiting the number of target instruments. Nonetheless, these systems do
deal with the most commonly occurring drums, and so represent a good start-
ing point for further improvements.
As noted above, many of the systems do not take into account the pre-
dictability of percussion patterns within a given piece of music. However, it
has been established that the use of musicological modelling does consider-
ably improve the performance of a system using only low-level processing. In
particular, it should be feasible to integrate musicological modelling to many
of the separation-based models.
There has also been a trend towards adaptive systems that take into
account the characteristics of the signals being analysed when attempting
transcription, both in event-based and separation-based systems. This is an
attempt to overcome the large variances in the sounds obtained from a given
drum type such as a snare drum. For example, drums in a disco-style genre
have a totally different sound to those in a heavy metal-style piece. These
adaptive systems are to be encouraged, as a system that can be tailored to
suit individual signals is more likely to produce a successful transcription than
a system which makes use of general models.
5 Unpitched Percussion Transcription 161
Table 5.1. Summary of percussion transcription systems. The column Classes con-
tains the number of percussion classes covered by the system. Method describes
the overall approach, E for event-based systems, S for separation-based systems,
M for systems including musicological modelling, and A for systems using adaptive
modelling. An X in Drums only indicates that a system operates on signals con-
taining drums only. Mono/Poly shows whether the system can detect two or more
simultaneous sounds
solved, and it is hoped that this chapter reflects only the beginning of the
study of unpitched percussion transcription.
5.6 Acknowledgements
Derry FitzGerald was supported in this work by the Irish Research Council
for Science, Engineering and Technology.
6
Automatic Classification of Pitched Musical
Instrument Sounds
6.1 Introduction
This chapter discusses the problem of automatically identifying the musical
instrument played in a given sound excerpt. Most of the research until now
has been carried out using isolated sounds, but there is also an increasing
amount of work dealing with instrument-labelling in more complex music sig-
nals, such as monotimbral phrases, duets, or even richer polyphonies. We first
describe basic concepts related to acoustics, musical instruments, and percep-
tion, insofar as they are relevant for dealing, with the present problem. Then,
we present a practical approach to this problem, with a special emphasis on
methodological issues. Acoustic features, or, descriptors, as will be argued, are
a keystone for the problem and therefore we devote a long section to some of
the most useful ones, and we discuss strategies for selecting the best features
when large sets of them are available. Several techniques for automatic classi-
fication, complementing those explained in Chapter 2, are described. Once the
reader has been introduced to all the necessary tools, a review of the most rele-
vant instrument classification systems is presented, including approaches that
deal with continuous musical recordings. In the closing section, we summarize
the main conclusions and topics for future research.
TmiraiYgSel
Test St
'K^\l^^
Fig. 6 . 1 . Diagram of the different operations involved in setting up an automatic
classification system for musical instrument sounds. The training set is described by
extracting an initial set of features that is refined by means of feature transformation
and selection algorithms. The resulting set of selected features is used to train and
validate (i.e., fine-tune) the classifier. Validation can be done using a different set
of sounds (not shown here) or different partitions of the training set. When fine-
tuning is finished, the classifier is tested with a test set of sounds in order to assess
its expected performance. At the right side of the diagram, using dotted elements,
the automatic classification of an unlabelled (i.e., previously unseen) sound file is
illustrated.
higher auditory centres [260]. In the case of pitched musical instruments, the
relative strengths of the overtone partials (see Fig. 6.2) determine, to a cer-
tain extent, timbre sensations and identification. It seems t h a t , for sustained
sounds, the steady segment provides much more information t h a n the attack,
though the latter cannot be completely neglected [270]. Timbre discrimina-
tion experiments, where sounds are altered in subtle or dramatic ways and
the listeners indicate whether two different versions sound the same, have
provided cues concerning the relevant features for sound classification. Grey
and Moorer [254] found t h a t microvariations in amplitude and frequency are
usually of little importance, and t h a t the frequency and amplitude envelopes
can be smoothed and approximated with line segments without being noticed
by the listeners. Changes in temporal parameters (i.e., attack time, modula-
tions) may have a dramatic impact on the discrimination of timbres [83], [560],
166 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
Fig. 6.2. Example spectra from three different instruments. Prom top to bottom:
a clarinet (playing the note C4), a violin (C6) and a guitar (G5). In the clarinet
sound, note the predominance of the first even partials. In the guitar sound, note
the existence of plucking noise as energy unevenly distributed below 700 Hz.
though probably not in deciding the instrument name. The above facts were
also supported by McAdams et al. [446], who, additionally, found that the
spectral envelope shape and the spectral flux (time-variation of the spectrum;
see Section 6.3.2) were the most salient physical parameters affecting tim-
bre discrimination. Also, it has been noticed that the human sensitivity for
different features depends on the sound source in question.
Very few studies have investigated the human ability to discriminate be-
tween the sounds of different musical instruments. However, some trends can
be identified based on the reviews and experiments by Martin [442] and
Srinivasan et al. [606]. First, humans, even those with musical training, rarely
show performance rates better than 90%. The number of categories in the cited
experiments varied from 9 to 39, and in the most difficult cases the recognition
rate dropped to 40%. Second, confusion between certain instruments are quite
usual, for example, between the French horn and the trombone. Third, the
discrimination performance can be improved by musical instruction and by
exposure to the acoustic material, especially to pairs of sounds from different
instruments. Fourth, instrument families are easier to identify than individ-
ual instruments. Finally, contextual information (i.e., listening to instruments
playing phrases, instead of isolated notes) substantially improves the identifi-
cation performance [340], [57].
6 Automatic Classification of Pitched Musical Instrument Sounds 167
Table 6.1. Other criteria for elaborating subclasses can be the playing method,
the shape of the instrument, the relationships of the exciting element to the
resonating element, or the method used to put the exciting element into mo-
tion. For a detailed account on the acoustics of musical instruments, the reader
is referred to Fletcher and Rossing [193] and Rossing [551].
6.2 Methodology
6.2.1 Databases
Databases are one of the crucial elements needed for developing a successful
classification system, as they have to include enough 'representative examples'
in order to grant the generalizability of the models built upon them [112].
Proper data modelling requires the careful preparation of up to three different
and independent data sets (see Fig. 6.1). The first one, usually termed a
training set^ is used to build the models, whereas the second one, usually
termed a testing set^ is only used to test the model (or the system using it)
and to get an estimate of its efficacy when it will be running in a real-world
system. A third set, usually termed a validation set, is sometimes used during
the design, improvement, and tweaking of a given model. In that case, the
model, as it evolves and improves, is tested using the validation set, and only
when the model preparation phase is finished (i.e., when the performance
improvement on the training data is no longer matched by the performance
improvement on the validation set) is it evaluated against the testing set,
which is kept untouched until then. Of course, the three sets should be sampled
from the same population of sounds.
A testing set that could be shared among research teams would help them
to compare their respective improvements. Unfortunately, most of commer-
cial audio files, MIDI files, and digitalized score files cannot be shared. In the
automatic classification of musical instruments, the commercial McGill Uni-
versity Master Samples collection (MUMS) [487] has been frequently used,
though it has not achieved the status of 'reference test set'. More recently,
the University of Iowa sample collection^ and, especially, the RWC database
[230] are attracting the attention of researchers. The latter contains a wide
variety of music files to be used in several music processing problems, and
^http://theremin.music.uiowa.edu/index.html
170 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
^http://creativecommons.org
6 Automatic Classification of Pitched Musical Instrument Sounds 171
Signal energy or power can be measured at different time scales and used as
an acoustic descriptor. Although these have not shown a high discriminative
power when compared to other features used for musical instrument classifi-
cation, they provide basic information that can be exploited to derive more
complex descriptors, or to filter out potential outhers of a sound collection.
The root mean square (RMS) level of a signal is often used to represent
the perceptual concept of loudness. The RMS level of a discrete time signal
x{n) is calculated as
1 ^'^
\ 2= 0
where n is a discrete time index and A^ is the size of the analysis frame.
6 Automatic Classification of Pitched Musical Instrument Sounds 173
6.3.2 S p e c t r a l F e a t u r e s
(ntil^t(fc)l)"
SFM(^) = 101ogio (6.5)
where Xt(A:) and Xt-i{k) are energy-normalized Fourier spectra in the current
frame and in the previous frame, respectively.
Spectral irregularity measures the 'jaggedness' of the spectrum [374]. It is
computed as
(6.7)
|X(A:)|2>7f^|X(fc)|2 (6.8)
k=l k=l
1 ^
ZCR(n) ^w^J2 Isign[x(n + i)] - sign[a:(n + i - 1)]|, (6.9)
i=l
where
-hi ifx>0
sign(a:) = < 0 if X = 0 . (6.10)
-1 ifx<0
6 Automatic Classification of Pitched Musical Instrument Sounds 175
^Cep.(,)=^"-ir'^'""/'', (6^.1)
where t is the frame index, Cep^(f) is the ith coefficient in frame t, and usually
M is 1 or 2. The delta-delta, in turn, can be computed by substituting Cep^{t)
by Z\Cep^(t) in the above equation.
OER=^^i^-^^. (6.13)
E g even ( ^ )
Tl = -P^^, (6.14)
^ ^ ^ a ^ ( 2 ) + a ^ ( 3 ) + a-(4)
The time dimension is usually less represented in the feature sets proposed for
the automatic classification of musical sounds. The evolution of a given feature
over time can be partially characterized by computing its variance, or the first-
and second-order differences. Apart from these, specialized descriptors, such
as the attack time, the temporal centroid, or the rate and depth of frequency
modulation have proven to be useful for discriminating between instrument
sounds [442], [199], [514], [615], [367].
The term amplitude envelope is generally used to refer to a temporally
smoothed version of the signal level as a function of time. In practice, it can
be calculated by lowpass filtering (with a 30-Hz cut-off frequency) the vector
of RMS levels E{n) of a signal. In the case of analysing isolated notes, once
the envelope is computed, it is possible to segment it into attack, sustain, and
release sections, as shown in Fig. 6.4 (though percussion and plucked string
sounds do not have the sustain part). Specific descriptors for each of these
segments can also be computed.
6 Automatic Classification of Pitched Musical Instrument Sounds 177
Strong
mid-frequency
partiais
Strong
high-frequency
partiais
Fig. 6.3. Geometric interpretation of the tristimulus. The left figure shows the re-
gions where, depending on the energy balance, the values of of T2 and T3 will be
found. The figure on the right illustrates the temporal evolution (in milliseconds)
of a clarinet note: it starts with a strong fundamental, then high frequency par-
tiais progressively dominate the sound, and finally, after 60 milliseconds, the high
frequencies start to decay until the end of the sound.
Attack
Fig. 6.4. SimpHfied amplitude envelopes of a guitar tone (above) and a violin tone
(below). Different temporal segments of the tones are indicated in the figure.
T h e attack time is sometimes also called the 'rise time', and its definition
varies slightly depending on the author. An often-used definition is the time
interval between the point the audio signal reaches 20% of its m a x i m u m value
178 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
and the point it reaches 80% of its maximum value [511]. Sometimes the
logarithm of the attack time is used instead of the raw value:
where t2o and tgo denote the beginning and end of the attack, respectively.
The temporal centroid measures the balancing point of the amplitude en-
velope of a sound, and it is calculated as
where E{n) denotes the RMS level of the sound at time n, and the summation
extends over a fixed-length segment starting at the onset of the sound.
The term vibrato refers to a periodic oscillation of the fundamental fre-
quency of a sound. Vibrato has proven to be quite a useful feature for instru-
ment discrimination, whereas this does not seem to be the case with tremolo^
which refers to a periodic oscillation in amplitude. Vibrato is characteristic
for string instruments, reeds, and the human singing voice. Vibrato can be
described by its rate, which is usually between 4 and 8 Hz, and its depth,
which is usually less than one semitone. Techniques for estimating the rate
and depth of vibrato are described in Chapter 12, Section 12.4.4.
Scale Transformation
Arcsine-root [410], which consists of taking the arcsine of the square root
of the initial feature value; this transform spreads the feature values to the
tails of the distribution (that is, preferably away from its central part).
Therefore, this transform is indicated when the features are proportions
such as the ratio of band energy to the total spectral energy.
Apart from these simple transforms, the Box-Cox power transform [47] is
a standard tool for increasing the features Gaussianity, hence it is especially
recommended when working with Gaussian mixture classifiers [510]. It can be
computed as
Projection
Multidimensional Scaling
{
dx \ ^^^
f;[o,(/)-o,(0]'i . (6.20)
Given these distances, we need to learn a function h that approximately
maps the proximities pij to the distances d(oi,Oj), and the location (the
coordinates) of the objects Oj, z = 1 , . . . , m in A*. This is done by minimizing
the stress function
r n n -i 1/2
i=l j>i
S(h,oi,. . . , o ^ (6.21)
2=1 j>i
1 ^
i:B = 7 ^ m , ( M j - M ) ( M ^ - / i r , (6-22)
1 ^
j=l x|y=j
Using very large sets of features for building an automatic classification sys-
tem is usually to be discouraged: First, some features can be redundant or
irrelevant; second, the computational cost for using many of them might be
high; and third, some features can be misleading or inconsistent regarding
the task, and consequently the classification errors may increase. In any case,
interpreting a model containing a large set of features can be very difficult or
even impossible. In general, the informal recommendation is to use ten times
fewer features than training instances^ [313]. Selecting features can be done
on a ranking basis (i.e., evaluating one feature after another) or on a best-set
basis (i.e., evaluating subsets of features in a global way) [262].
We list below three different strategies in order to find a near-optimal
number of features for a classification task [42]:
Recent techniques such as support vector machines (see Chapter 2), however,
are less subject to dimensionality concerns. In any case, including misleading features
lowers the performance.
182 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
Embedding makes the feature selection stage intertwined with the classifi-
cation algorithm, as in the case with decision trees, or with discriminant
analysis.
Filtering decouples feature selection from the model learning process by
first applying a feature selection over the original feature set, and then
feeding the classification algorithm with the selected features only.
Wrapping uses a features evaluation step which is intimately connected
with the learning process: it uses the prediction performance of a given
learning algorithm to assess the relative usefulness of subsets of features.
Theoretically, this strategy should be the best one [42], but the price paid
is a high computation time (this is an NP-hard problem).
In addition to selecting the features with respect to a classification algo-
rithm, we must decide on an evaluation criterion. The information gain is the
standard criterion used to build decision trees [467], but it can be also used
to rank the importance of features, outside of the decision trees framework.
In order to characterize the amount of information carried by a feature (e.g.,
the zero crossing rate), we study its influence on the entropy of the full set
of features, via the information gain. Let x denote the vector made of several
examples of a given feature (e.g., the zero crossing rate over several frames)
and let X be the set of all the features, each of which being extracted over sev-
eral frames. The entropy H(a) of a set of random variables a with probability
density function p(a) is defined as^
H(a) = - y p ( a ) l o g p ( a ) d a , (6.25)
where Cfc is the average feature-class correlation and Cff is the average feature-
feature intercorrelation.^ We can interpret the numerator of (6.27) as an in-
dicator of how a set of features is representative of the class, whereas the
denominator indicates how much redundancy there is among the features.
The correlation-based feature selection technique is not a ranking method,
as it works with subsets of features. In practice, subsets are examined using
either backward search or forward search of best first search. Backward search
consists of starting from the full feature set and greedily removing one fea-
ture at a time as long as the evaluation does not degrade too much). Forward
search starts from an empty set of features and greedily adds one feature at
a time until no possible single feature addition results in a higher evaluation.
Best first search starts with either no features or all the features and examines
a given number of consecutive expansions or reductions of the existing subset
in order to find local improvements over the current best subset [540], [267].
In the context of classification, given a training set of features grouped
in J classes, Peeters [510] has proposed audio feature selection using inertia
ratio maximization using feature space projection (IRMFSP), which seems to
compare advantageously with other effective algorithms and has also been
used by other researchers in sound classification [416], [180]. IRMFSP selects
first the best features according to the value R[i] for the feature # i (where the
index i refers to a given type of feature, e.g., the zero crossing rate) defined
as
E fnj\\^lj[i] - ix[i]f
m = N '^r , (6-28)
Eii^"i ^]-^l[i]f
n=l
^The term 'correlation' was used by Hall in its general sense, without referring
specifically to the classical correlation [267, p. 51]. However, in his implementation
in the free software Weka [674] (www.cs.waikato.ac.nz/ml/weka), the author used
the classical variance-normalized correlation, removing, however, the mean of the
data before calculating the correlation.
184 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
Discriminant analysis (DA) includes several variants of the generic idea of de-
riving a discrimination function (i.e., one that separates two classes of objects)
that is a weighted combination of a subset of the features used to character-
ize a series of observations. As we have seen in previous sections, this idea
can be also used for selecting and projecting features by minimizing the ra-
tio of within-class scatter to the between-class scatter. A thorough formal
186 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
2 L
h B /
I A\m
I -.. A..--
r A
Fig. 6.5. An illustration of /c-NN classification. The point marked with a star would
be classified as belonging to category B when /c = 3 (as 2 out of its 3 neighbours are
from class B; but in case of using k = 5 classification would be A because there are
3 nearest neighbours belonging to this category and only 2 belonging to B.
Decision trees are pervasively used for different machine learning and classifi-
cation tasks. One of the main reasons for their popularity may lie in the fact
that they produce a simple classification procedure which can be interpreted
and understood as a series of 'if-then' actions. Decision trees are constructed
top-down, beginning with the feature that seems to be the most informative.
6 Automatic Classification of Pitched Musical Instrument Sounds 187
that is, the one that maximally reduces entropy according to the information
gain measure (6.26). For this feature, several branches are created: one for each
of its possible (discrete) values. In the case of non-discrete valued features,
a procedure for discretization of the value range must be defined (see, for
instance, [183]). The training data are assigned to a descendant node^ at the
bottom of the branch that corresponds their value. This process is repeated
recursively starting from each descendant node. An in-depth treatment of de-
cision trees can be found in Mitchell [467] and Duda et al. [161]. Quinlan's IDS
and C4'5 [533] are among the most popular algorithms for building decision
trees.
Decision trees used for the classification of instrument sounds have usually
yielded worse results than other classification methods [288], [510]. Otherwise
they have provided hints on the nature of the features and values that dis-
criminate among pitched instrument classes [319], [668]. Recent enhancements
to basic decision trees such as AdaBoost [196] or Random Forests [51] may
provide results that are competitive with other cutting-edge classification al-
gorithms.
Input units
Hidden units
Output units
Fig. 6.6. A diagram of an artificial neural network with 8 input units, a hidden
layer with 3 units, and an output layer with 8 units.
In this section, we review research aimed at developing systems for the auto-
matic classification of isolated sounds of pitched musical instruments. As we
will see, for the more complex tasks of recognizing instruments in solo and
duet phrases or in polyphonic music, the fact that each research team has
been using a different test database makes it unfair and unreliable to make
direct comparisons between the reported classification accuracies. The goal of
this section is, hence, to provide enough information so that the reader can
understand the work done and to evaluate the main outcomes of the research
himself. We first examine flat systems, which provide instrument labels in a
single classification step. Then we will move to systems that, instead of di-
rectly deciding the instrumental class, proceed from broader categories to the
more specific ones, following a hierarchical classification scheme.
6 Automatic Classification of Pitched Musical Instrument Sounds 189
One of the influential ideas of Martin [441], [442], [443] was to use a hierarchical
procedure consisting of (1) initial discrimination between pizzicato (plucked)
and sustained sounds, (2) discrimination between different instrument families
(e.g., strings, woodwind, and brass), and (3) depending on the previous deci-
sions, final classification into specific instrument categories. Other hierarchical
systems have been developed since then by Eronen et al. [176], [174], Agostini
et al. [11], Szczuko et al. [615] and finally, Peeters et al. [515], [510], which
is probably a fair representative of the current state of the art in instrument
classification.
Table 6.3 summarizes some facts about the systems mentioned in this
section. As can be seen, the results using a hierarchical instead of a flat clas-
siflcation scheme have not been conclusive: some authors report moderate
improvements, whereas others report moderate deterioration in error rates.
On the other hand, what is consistent is a trend of increasing difficulty when
the categorization of an instrument goes from the most abstract to the most
specific: the pizzicato versus sustained decision is very easy, the family classi-
fication problem is a bit more difficult, and finally, the specific assignment of
instrument labels still leaves some room for improvement.
Although the use of hierarchies is conceptually appealing, it is worth noting
that errors at each level are 'carried over' in a multiplicative way. Therefore, if
there is an error at the topmost level, it will be propagated to the lower levels
of the hierarchy. One rule of thumb for trying a hierarchical system would be to
look at the between-class confusion matrices and search for a large number of
confusions between instruments from different families. In this case, provided
a very efficient family discriminator, these confusions could be reduced by
means of a series of hierarchical decisions.
Instead of embedding some taxonomic knowledge into the classifier (i.e.
hardwiring the family and subfamily taxonomy), Agostini [10] and Kitahara
et al. [346] have approached the automatic building of instrument taxonomies
6 Automatic Classification of Pitched Musical Instrument Sounds 193
instrument B, and the decisions are then combined by voting [177]. In addition
to improving the classification accuracy under certain circumstances, the use
of pair-wise classifiers makes it possible to find some characteristic features
that are very useful in a given pair-wise discrimination but remain useless in
others.
Exploring new features and increasing their number is another trend ob-
served in this research context. The log-energy within octave sub-bands and
the logarithm of the energy ratio of adjacent sub-bands have been used as
a way to characterize the spectral energy distribution of a sound [177]. Line
spectral frequencies have also been incorporated as a way to model the reso-
nances and peaks of the power spectrum with higher precision and robustness
than linear prediction coefficients (LPC) [375], [87].
Independent subspace analysis (ISA) has been successfully used for per-
cussive sound classification (see Chapter 5) but it has only been tested for
classifying pitched sounds by Eronen [175] and by Vincent and Rodet [647].
The latter represented the short-time spectrum of musical excerpts as a non-
linear weighted sum of typical spectra plus noise, which were learned using
files from a database containing isolated notes and solo recordings. These tem-
plates were then used to determine, with a very high accuracy, the instrument
played in the solos of commercial recordings (even when they were artificially
distorted with reverberation or noise). The authors showed that their model
has some theoretical advantages over methods based on GMMs or on linear
ISA and that it worked successfully even for the classification of instruments
in duets.
A surprising observation is that any of the reviewed systems do not seg-
ment solo phrases into notes. The addition of a reliable onset detector would
allow the inclusion of envelope-related temporal features and the reduction
of computational load by doing the actual classification only once per note
onset. Onset detection is, however, a hard problem in the case of music signals
that do not contain percussion instruments [95].
the spectrum of the second one. Using this approach, harmonic cancellations
and erroneous enhancement of partials should be expected, depending on the
degree of overlap between the spectra of the two sounds.
Livshin and Rodet [416] approached the classification of instruments in
duets by using their real-time solo recognition system (see Table 6.4). In or-
der to detect instruments in duets, the system first estimated the two FOs
using an algorithm by Yeh and Robel [681], and also computed their corre-
sponding harmonic partials. The fundamental frequencies were quantized to
the nearest musical note, and contiguous frames with the same value were
chunked together. Each chunk was then used twice in a phase-vocoder fil-
tering process of source reduction: in the first pass, all the harmonics of the
estimated fundamental were kept, whereas in the second pass, the sustained
note's harmonics were filtered out and the 'residual' partials were kept. Over-
lapping harmonics of the two notes were not filtered out. Finally, the partials
of the fundamental (intended to correspond to one instrument) and the resid-
ual partials (intended to correspond to the other instrument) were sent to a
classifier in order to generate their corresponding labels.
Kostek et al. [365], [362] proposed the decomposition of duet sounds based
on the modified frequency envelope distribution (FED) analysis, which was
originally described in [370]. The FED algorithm decomposes a signal into a
linear expansions of sinusoids with time-varying amplitudes and phases. The
first step of the duet analysis method is the estimation of the FO of the lower-
pitched instrument. The input signal is divided into short overlapping blocks,
and FO is estimated for each block separately to deliver the FO contour. Using
the FO information and the FED algorithm, the time-varying amplitudes and
phases of the first ten harmonics of the sound are estimated and cancelled
6 Automatic Classification of Pitched Musical Instrument Sounds 197
from the signal, in order to obtain a residual where the harmonics of the
second sound are analysed. The estimated spectra of the two sounds are then
fed to a neural classifier to recognize the two instruments.
Another approach to analysing duet signals is based on a so-called missing
feature theory that was developed for speech processing and speaker identifica-
tion. The main idea consists of using only the spectro-temporal regions which
are dominated by the target sound, and ignoring those that are dominated by
background noise or interfering tones. This approach is motivated by a model
of auditory perception, proposed by Cooke et al. [99], which postulates a sim-
ilar process in listeners. In polyphonic music, partials of one instrument often
overlap with those of another one. Consequently, the observed amplitudes of
these partials no longer correspond to those of any individual instrument.
Within the missing feature approach, these corrupted or unreliable features
can be excluded from the recognition process. The remaining information is
therefore incomplete, but the hope is that it is still sufficient to enable robust
instrument classification (additionally, it is possible to partially reconstruct
the missing values by exploiting known correlations between the missing and
the rehable values). Eggink and Brown [168], [167] used sub-band energies as
features, although other features could be utilized, too.
6.7 Conclusions
In this chapter we have provided a review of the theoretical, methodological,
and practical issues involved in the automatic classification of pitched musical
instruments. Assigning instrument labels to analysis frames, sounds, or musi-
cal segments requires, first, a solid knowledge of the acoustic features of the
instruments, and of the ways we can exploit signal processing techniques to
convert them into numerical features. Additionally, a systematic methodology
comprising the collection of data sets for training and testing, the selection
and transformation of features, and the comparison of the results obtained
using different approaches, defines the right path for obtaining a robust clas-
sification system.
In the five years from our first review of the field of automatic classification
of musical sounds [283] to the present moment, the number of studies exceeded
twice the number of those published in 1990s. Some general tendencies can be
noted in these recent works: use of larger and more varied databases, interest
for unpitched percussion sounds, improvements in the methodological aspects,
and an increasing concern for practical applications and for dealing with truly
musical fragments.
The performance of the systems dealing with a large amount of isolated
sounds and pitched instrument classes achieves correct decisions nearly 70%
of the time, whereas a simpler decision on instrument family rises a bit beyond
80% accuracy. This leaves some room for improvement that could be achieved,
among other options, by carefully looking at the discriminative acoustic and
perceptual features that each class of sounds may have and then devising
feature extractors that capture them properly.
On the other hand, systems dealing with the classification of instruments
in musical excerpts are the actual hot-spot of the field even though the achieve-
ments are still quite modest. For the classification of solo phrases, the achiev-
able performance can be a bit better than that for isolated sounds, but when
duets or more complex combinations are considered, the performance drops
substantially. In those cases, systems become more complex as they rely on
multiple-FO estimation and on incomplete or noisy estimation of spectral and
temporal information. The current approaches try to avoid 'hard' source sep-
aration and exploit contextual or musical knowledge. This makes the problem
more manageable, even though the provided solutions have still been limited
in terms of sound combinations or musical styles.
^See also Chapter 5, p. 137, where the idea of recognizing combinations of sounds
directly is discussed from the viewpoint of percussion transcription.
200 Perfecto Herrera-Boyer, Anssi Klapuri, and Manuel Davy
In this review we have identified several open issues that could provide in-
teresting returns when properly addressed: (1) The need of a reference test col-
lection, containing enough variability in instruments, recording conditions and
performers to be considered as an unbiased sample of the real population of
instrument sounds, and granting that any proposed system can be fairly com-
pared to other alternative proposals. Fortunately, RWC is currently a serious
candidate that should gain wider acceptance among research groups. (2) The
need to develop better features and instrument-specific features. (3) The need
to investigate possibilities to embed some general knowledge about the task
into the classification system, such as the usual frequency ranges of the instru-
ments, voice leading rules etc., but also very specific knowledge, for example
by crafting ensembles of specialized classifiers. (4) The need to evaluate the ro-
bustness of a system under reverberant, noisy, or other distortion conditions.
(5) Incorporating instruments outside the typical orchestral ones: singing voice
and some electrophones, for instance, the electric guitar, would deserve spe-
cific studies by themselves, given their broad timbral registers. (6) The need
to the develop systems dealing with realistic polyphonic music signals.
The automatic classification of pitched sounds of musical instrument has
progressed a lot in the past five years. Even though real time [199], [415] and
commercial systems for instrument sound classification have been devised, ^^
we expect to find more of them soon, in connection with applied problems
posed by personal digital music players. This means that now is time to ex-
ploit the knowledge we have gained working with isolated sounds, in order to
address the identification of instruments played in polyphonic music. This is,
without doubt, the challenge for the forthcoming years.
Acknowledgements
The writing of this chapter was partially supported by the EU project SIMAC
(Semantic Interaction with Music Audio Contents) EU-FP6-IST-507142. The
first author wishes to acknowledge the input, feedback, and help received
during the preparation of the manuscript from Eduard Ay Ion, Emilia Gomez,
Fabien Gouyon, Enric Guaus, Eulalia Montalvo, Bee Suan Ong, and Sebastian
Streich.
^^http://www.musclefish.com, http://www.soundfisher.com,
h t t p : / / c u i d a d o s p . i r c a m . f r , http://www.audioclas.net,
Part III
Manuel Davy
Western tonal music is highly structured, both along the time axis and along
the frequency axis. The time structure is described in other chapters of this
book (see Chapter 4), and it may be exploited to build efficient beat trackers,
for example. The frequency structure is also quite strong in tonal music. It
has been shown since Helmholtz (and probably before) that an individual
note is composed of one fundamental and several overtone partials [451], [193].
Though acoustic waveforms may vary from one musical instrument to another,
and even from one performance to another with the same instrument, they
can be modelled accurately using a unique mathematical model, with different
parameters.
In addition to a mathematical model that describes the waveform gener-
ation, the frequency structure of music can be used to derive priors over the
model parameter values. Here, we understand frequency structure in terms
of fundamental and partials structure, pitch/FO structure, etc. For example,
assume the instrument playing is a piano; then the note frequencies cannot
be just any frequencies; they have to match the piano key frequencies. Also,
the piano overtone partial frequencies are slightly inharmonic (that is, they
are not integer multiples of the fundamental partial frequency), and their fre-
quencies are described by a specific model [451], [193] which can be used to
build parameter priors. More generally, the structure of tonal music may be
exploited to build a Bayesian models that is, a mathematical model embedded
into a probabilistic framework that leads to the simplest model that explains
a given waveform (see Chapter 2 for an introduction).
The Bayesian setting is quite natural for this problem as it enables
the use of many heuristics within a rigorous framework. Moreover, acoustic
waveform models generally have many parameters, which cannot be accu-
rately estimated without regularizing assumptions, such as parameter priors.
Bayesian models for multiple FO estimation have received, however, relatively
little attention. A possible cause is that such models are complex, and this
makes their use difficultthough achievablewhen confronted with real data.
Sometimes they are also computationally heavy. However, such models enable
204 Manuel Davy
much more than multiple FO tracking. They do model the acoustic waveform:
the parameters which are estimated from real musical records may be used
for multiple FO estimation, but also for monaural source separation, sound
compression, pitch correction, etc.
In this chapter, we present several approaches to multiple FO estimation
that rely on a generative model of the acoustic waveform. More precisely,
we present a noisy sum-of-sines models which has been studied by many
authors in various contexts. This model and some of its variants are pre-
sented in Section 7.1. Section 7.2 introduces a Bayesian off-line processing
method which requires notewise processing (that is, processing is performed
on a complete waveform section which does not include note changes). In Sec-
tion 7.3, we present the on-line processing model of Cemgil et al. [78], Dubois
and Davy [158], and Vincent and Plumbley [646]. Section 7.4 is devoted to
other on-hne multiple FO tracking algorithms that rely on incomplete or in-
direct acoustic waveform modelling: for example, the approach of Thornburg
et al. [625] and Sterian et al. [609] models the time evolution of time-frequency
energy peaks. Dubois and Davy [159] model the signal spectrogram (which
comes down to estimating on-line the acoustic waveform up to the initial phase
parameter, though). Section 7.5 presents some conclusions.
In this section, we first present some simple models which were developed for
single FO acoustic signals. The earliest noisy sum-of-sines acoustic waveform
models were developed for speech synthesis; see e.g. [449]. These models did
not assume, however, frequency relations between the fundamental partial and
overtone partials. Laroche et al. introduce a harmonic plus noise model [394]
which assumes such relations. These models were soon used for music process-
ing; see [575] for a review of early methods.
The frequency structure of tonal music acoustic waveforms has been observed
for many years. As can be seen in Fig. 7.1, these waveforms are almost periodic
and their Fourier transforms reduce to (approximately) sums of sine waves
whose frequencies are multiples of a given frequency. For a perfectly periodic
(infinite length) signal x, with discrete time n = 1, 2 , . . . ,
M
^(^) ~ / ^ o^sin(27rmA:in) -f a^cos(27rmfein). (7.1)
overtone partials} The model in (7.1) is quite simple, but it is rather theo-
retical. Real signals always include components which cannot be modelled as
individual sines or cosines: for example, a flute player breathing can be heard
in recorded signals, and this is highly non-periodic [219]. As such components
are quite diflPerent from one occurence to another, they can be jointly modelled
in terms of their statistical distribution as a noise component e, yielding the
model
M
x[n) 2_\ ^^ sin(27rmA:in) -[- a^ cos{27rmkin) + e(n). (7.2)
771=1
100 150
Time (ms)
Fig. 7.1. Flute acoustic waveform (top) together with its spectrogram (bottom).
Two noise statistical models have received some attention. The simplest
assumes e(n) to be a white noise with Gaussian distribution (see for exam-
ple [394], [616], [124]). This is the less informative assumption, as in this case.
^As pointed out in Chapter 1, the frequency of the fundamental partial, denoted
here by /ci, is different from the fundamental frequency FO, which is the inverse of
the acoustic waveform period.
206 Manuel Davy
e(n) is a purely random sequence with a flat power spectrum. Another popu-
lar model takes the form of an autoregressive process [576], [575], [292], [609],
[122] (see Chapter 2 for a presentation of autoregressive models), which also
corresponds to random sequences, but with non-fiat power spectrum. A review
of possible choices for e may be found in [310], [308].
In addition to non-harmonic components, acoustic waveforms produced
by real musical instruments also have another important characteristic: they
are not strictly periodic. This is explained by two phenomena: inharmonicity
(or partial de-tuning) and partials amplitude nonstationarity. Inharmonicity
appears whenever the frequency of the partial with harmonic number m is not
exactly mki. For the example in Fig. 7.2, where several periods of the acoustic
waveform produced by a piano, a flute, and a clarinet are superimposed, the
non-periodicity appears clearly. This may be caused by amplitudes decay (for
all three examples), but also by inharmonicity (piano example). Note that in
Fig. 7.2, two of the three periods plotted are contiguous, whereas one is taken
further apart. A more general model enabling inharmonicity is
M
x{n) = 2_2 ^^ sin(27rA:mn) -f a^ cos(27rA:mn) + e(n), (7.3)
m=l
3.8 0 20 7
Piano (FO 262 Hz) Flute (FO 490 Hz) Clarinet (FO 135 Hz)
Fig. 7.2. Three superimposed periods of the acoustic waveforms played by a piano,
a flute, and a clarinet. The waveforms are not strictly periodic and the three periods
represented (in solid, dashed, and dashed-dotted lines) are not exactly superimposed.
The time scale is in milliseconds.
7 Multiple FO Estimation Based on Generative Models 207
Aside from partial frequency models, it may be useful to model the partial-
to-partial amplitude profile (referred to as the spectrum envelope in the follow-
ing) . The power spectrum of a note is formed by the instrument body response
which modulates the partial frequency peaks. Of course, this modulation de-
pends on the instrument and it may be characterized by a spectrum envelope
(see the saxophone family example [193, p. 497] ) which may be modelled.
Spectrum envelope models need to be quite flexible, though, because some
instruments have special behavior. For example, clarinets have almost zero
amplitude for every other low frequency partials (partials with even harmonic
numbers m); see Fig. 6.2 p. 166. Godsill and Davy [213] propose a statisti-
cal model where the amplitudes are assumed to be approximately constant
below some cut-off frequency /Ccutoff, and decay exponentially for higher fre-
quencies. The parameters defining the model (fccutoff and the exponential decay
rate) are to be estimated from the processed acoustic waveform. Cemgil [78]
uses an exponential decay where the amplitude a^ of partial m equals [a|]
(and similarly with a ^ ) . Alternative models may be proposed, based on the
spectral smoothness principle; see Klapuri [351].
The models presented above permit quite good modelling of very short signal
portions, insofar as the amplitudes and frequencies do not vary too much over
time. As explained in Chapter 4, however, the amplitudes do vary quickly
enough so that the above models cannot be used to process musical segments
longer than 20 to 50 ms. A more general non-stationary model is
M
x{n) = ^ c^m(^) sin (27r/Cm(n)n) + a ^ ( n ) cos {2TTkm{n)n) + e(n), (7.5)
771=1
where the amplitudes c^%j^{n) and a^{n), the frequencies A:m(^) for m =
1 , . . . , M , and the noise statistics now depend on the time. This model is
quite flexible, but it is no more a sine-plus-noise model: to understand this,
assume km{n) is an independent sequence of random frequencies with M = 1;
then the wave generated is not a sine wave at all! This shows that the model
in (7.5) should be constrained so as to be suited to tonal music. This can be
done in many ways, but the simplest is certainly to assume a smooth time evo-
lution of the amplitudes and frequencies, and assume a statistically (almost)
stationary noise.
Many amplitude and frequency evolution models may be found. The am-
plitude evolution model should mimic the way notes appear and disappear
in music (onset and decay), whereas the frequency models should adapt to
stationary cases (for instruments such as the piano where the performer has
quite limited influence on the note frequencies evolution) or vibrato (e.g., for
violins). Many such models may be found in the music synthesis literature,
and they are generally quite instrument specific; see e.g. [543], [628]. These
208 Manuel Davy
models may be used for musical signal analysis, in particular when the analy-
sis conditions are well controlled. Here, we present more general models which
can adapt to different instruments and different kinds of music.
As pointed out above, amplitude evolution models need to allow quick enough
variations in order to fit note onset and decay. However, amplitudes should not
vary too quickly: to make this point clearer, assume the acoustic waveform is
a sine wave and let M = 1 and fci = 0. Then, the only way for the model to fit
the data is for the amplitude itself to be a sine wave. This illustrates that, given
an acoustic waveform, there is not a unique frequency/amplitude parameter
set for the model in (7.5); rather, there are many. It is thus important to
prevent the time-varying amplitudes from fitting the sine waves. This can
be easily done by selecting amplitude evolution models that do not permit
oscillations with frequencies over some 10 Hz, for example.
A relevant model is that of damped amplitudes^ where the amplitude
evolves according to a decreasing exponential; that is (where we drop the
partial index m for notation clarity),
where the damping factor A tunes the amplitude decay rate, and a^, 5^ are
fixed initial amplitudes. When substituted into (7.5), this yields a damped
sinusoids model, as used by Hilands and Thomopoulos [292] for multiple si-
nusoids frequency estimation or by Cemgil et al. [78] for music transcription;
see Section 7.3 below. Note that such an amplitude time evolution may be
coupled with a spectrum envelope as described in Section 7.1.1.
Another amplitude evolution model consists of assuming a random walk
or an autoregressive process, as proposed in [106] for chirp signals. This re-
duces the amplitude parameters to the set of AR coefficients, which may be
small. These coefficients should be chosen, however, so as to ensure a smooth
amplitude.'^
The last model presented here writes the time-varying amplitude as a sum
of weighted smooth, time-localized functions with time-domain shape (j){n)
I I
where af, a^ are the amplitudes (also called weights) associated to each time-
localized function (f)[n iAn]- The step A^ sets the spacing between two
such successive functions. The shape (f) is typically chosen so as to obtain
smooth amplitude profile; in general it is one of the standard 'shding windows':
^This can be obtained by choosing the AR coefficients whose corresponding char-
acteristic polynomial has zeros with relatively small amplitude [339].
7 Multiple FO Estimation Based on Generative Models 209
Gaussian, Hamming, Hanning, etc. Here also, the amplitude evolution from
one frame to another may follow a random walk [159] or may be unrelated
a priori [616]. An important remark is that the resulting time-varying sum-
of-sines model is closely related to a Gabor representation^ where a signal is
decomposed into windowed sine/cosine waves whose time-frequency locations
are determined by a regular lattice [184]. Here, the lattice is regular along
the time axis and irregular along the frequency axis. Gabor-style amplitude
models were used by Davy and Godsill [122] (irregular lattice) and Wolfe
et al. [676] (regular lattice).
where Vm{n) is a Gaussian white random noise with some fixed variance.
In order to be valid, though, the noises Vm{n), m = 1 , . . . , M have to be
correlated so that the frequencies of related partials follow similar evolutions.
In the case of abrupt frequency changes, the models presented here are
out of their validity domain. However, a standard assumption is that no note
changes occur within the acoustic waveform segments processed.
(7.9)
where the amplitudes and frequencies are modelled as in the single FO mod-
els discussed above. In specific contexts, it may also be useful to model links
between the frequencies and amplitudes of different notes. For example, the
sound of an electromechanical organ may be modulated by rotating loud-
speakers; the strings of an electric guitar may be jointly tightened by a moving
bridge. Such links may be either deterministic or probabilistic.
210 Manuel Davy
7.2.1 Likelihood
: '. . 0
0 ... 0 a^ ... ai 1
Using these notations, the likehhood is written
P(^'IV^) (7.15)
p ( a 2 | J , M , k , x ) = IQ ( . 2 ; ^ , f^o|_fH) , (7.17)
p(J,M,k|i/.) = p ( k | J , M , V ) p ( J , M | V ) . (7.20)
Similar to the amplitude prior, the frequencies prior may have various shapes,
depending on the level of prior information available. A very general prior is
M,
p(k|J,M,V^) = l[ p{kj^i\Mj,ip) Y[ Pikj^mlkj^i.'ip) (7.21)
j=i m=l
mHMAAAAAAAAAA
Al BlCl Dl ElFl Gl A2 B2 C2 D2
Frequency (note)
Fig. 7.3. Prior distribution of the fundamental frequencies p(/cj,i|Mj,i/?) in a 'key
instruments' model. The spread of the Gaussian function increases with frequency,
i.e., with the note label in {A1,A1#,B1, ... }. This is aimed at limiting areas where
the prior is zero, in order to make the numerical estimation easier.
where cr^^ is a small variance parameter (which can be chosen equal for all
partials and all notes). In (7.22), the indicator function I (see p. 28) restricts
kj^rn to the range [-^, - ^ ] to avoid overtone partial frequencies switching.
The last prior term to be defined, p(J, Mlt/?), is also the most critical.
This term should be strong enough to avoid models with too many notes and
too many partials. However, it should not overly penalize models with many
notes and partials, because some important partial/note components may be
missed.
Overall, there are two standard choices for p{J,M\ip) (others may be de-
signed, though). The first consists of using the hierarchical structure p( J, M | ^ )
= p(J|t/?) n / = i Pi^jW^ where each term p{Mj\xp) is a Poisson distribution
with parameter Aj
p{Mj\iP) = P ( M , ; A , ) = e - ^ ^ ^ , (7.23)
The computation of estimates from the posterior p(J, M , a , k , cr^|x, V^) re-
quires the 'exploration' of this multidimensional probability distribution. Sev-
eral works address similar problems. Andrieu and Doucet [21] propose a
Markov chain Monte Carlo (MCMC) algorithm for a simple noisy sum-of-
sines Bayesian model. Walmsley et al. [657], [658], [656], derive an MCMC
algorithm for single FO Bayesian models. Finally, Davy and Godsill [122],
[126], [124] propose various MCMC approaches for multiple FO models, with
a fast implementation [127], [124].
As explained in Chapter 2, the aim of MCMC algorithms is to produce
a chain of samples J^'\M^'\oL^'\k^'\a'^^'\^'<'\ for i = 1,2,3,.... These
samples are used to estimate the various quantities described in Section 7.2.3
above. The derivation of such algorithms being quite lengthy, the interested
reader may refer to the publications cited for details. We present below (Al-
gorithm 7.1) the general structure of an MCMC algorithm dedicated to a
multiple FO noisy sum-of-sines model, with a two-level hierarchical structure,
from the individual partial level up to the multiple-note level.
Initialization.
Step 7 . 1 . 1 Initialize the parameters J, M , a , k, a^, and ip.
- Sample tjA^^ from its prior distribution.
~ Sample J^^^ according to some initial distribution qinit(-^)-
- For j = 1 , . . . , J^^^ sample Mj according to its Poisson prior distribution.
~ Sample k^^^ according to qinit(k|x) where qinit(k|x) is the probability distribution
proportional to the Fourier spectrum of x (see [21] for a similar implementa-
tion).
-- Sample thejioise variance parameter a^^^^ according to its posterior distribution
p(a2|/^),M('\k(^\(a(^)),x,'0(^)) given in (7.17).
-- Sampje tjie amplitudes a^^^ according to their posterior distribution
p ( a | j ( ^ \ M ( ^ ) , k ( ^ \ a 2 ( i ) ^ ( ~ ( i ) ) ^ ^ ^ ^ ( i ) ) gj^gn j ^ (7;^g)_
F o r i = 1,2, . . . , i V , do ^ _
Step 7 . 1 . 2 Sample the note parameters J^'\ M^'\ k^'^ .
- With probability / x j , try to add a new note using a note birth move, which
consists of generating a set of note parameters (number of partials, frequencies,
amplitudes), and testing it using a Metropolis-Hastings reversible jump.
216 Manuel Davy
- Otherwise, with probability i/j, try to remove a note using a note death move,
which consists of selecting one of the existing notes at iteration i 1 and testing
its possible removal using a Metropolis-Hastings reversible jump.
- Otherwise, with probability 1 fij uj, try the note update move as follows:
Set/*) ^j^'~y.
For j = 1 , . . . , J^*), update the parameters of note # j by possibly changing
the number of partials or their frequencies and amplitudes, yielding M^,
k] ^ and a} \
Step 7.1.3 Sample g^ from p(a^| J^*\ M(^^k^*^x); see (7.17).
Step 7 . 1 . 4 Sample -^^'^ from its posterior distribution (either directly, or us-
ing a Metropolis-Hastings test).
- Set 2 ^ z + 1.
Algorithm 7.1 deserves several comments. First, the hierarchical levels ap-
pear clearly. The note death/birth/update moves correspond to the highest
level (that of notes). The middle level corresponds to adding/removing partials
inside a given note, or changing its fundamental partial frequency. The lowest
level is that of individual partials, whose frequencies and amplitudes may be
updated. Second, the Metropolis-Hastings moves as well as the initialization
require proposal distributions q(-)- In order for the distribution of the samples
j(^)^ M^^), a^*\ k^^\ 52(^)^ -0() to converge quickly to the posterior distribu-
tion, these proposals may be built on heuristics. For example, the spectrum of
X is used in order to build the frequencies proposal distribution. Other similar
heuristics may be included in the same way; see [124]. Overall, such MCMC
algorithms may be seen as a rigorous way to use various heuristics for multiple
FO estimation.
7.2.5 Performance
The above class of Bayesian models being based on a generative model, they
can be used for many tasks, including multiple FO estimation and signal com-
pression. The performance estimation thus depends on the task assigned to
the algorithm. The computation for the full inference problem reported in
Davy et al. [124] is about 1.35 seconds per MCMC iteration for 0.5-second
excerpts, where 800 iterations are necessary on average.
In terms of FO estimation, Davy et al. [124] report 100% accuracy for
single FO (that is, for J = 1), about 85% for J = 2, 75% accuracy for J = 3,
and 71% accuracy for J = 4. The test samples were random mixes of single
FO acoustic waveforms from Western classical music instruments. In terms
of residual energy and reconstruction accuracy, all experiments showed very
good reconstruction,^ in spite of some octave errors.
"^It is much more difficult to evaluate the reconstruction accuracy with a psychoa-
coustically relevant measure. The residual total energy may yield such a performance
estimator, though.
7 Multiple FO Estimation Based on Generative Models 217
7.2.6 Extensions
The above models may be extended in many ways. For example, it is possible
to use binaural (stereophonic) records in order to make the estimation more
robust [120]. This is implemented by defining two models, one for the left
channel and one for the right channel. These models are connected through
their parameters prior. This can be extended to multi-track recordings, as in
standard source separation approaches; see Chapter 9.
Another possible extension comes from a sequential view of music process-
ing. The off-line approach presented here concerns music waveform segments
which come from a longer excerpt, though they could be applied framewise.
It is possible to use the parameter estimation results of previous segments to
design the prior for the current waveform segment, in terms of instruments
playing, A4 frequency tuning, etc.
Cemgil et al. [78] propose to write a sine/cosine wave with frequency k and
unity amplitude as follows:
218 Manuel Davy
where A is the damping factor that tunes the amplitude decay rate, with
0 < A(n) < 1. The vector [1 0] is used to project the vector 0{n) onto one
axis; see Fig. 7.4. The initial phase is tuned by the initial vector 0(0) at time
0. In order to model the note partials, and their frequency/amplitude rela-
Fig. 7.4. The model based on a rotation matrix B(/c) rotates the vector 0{n) by
27r/t radians counter-clockwise, and applies a damping factor. When projected onto
one axis (here, the sine axis), this produces a damped sine wave with frequency k
(adapted from [78]).
'Xi{n)B{ki) 0 0
0 A2(n)B(/c2) :
A(fc) (7.31)
: *. 0
0 .-. 0 XM{n)B{kM)_
Cemgil et al. assume harmonic frequency relationships of the partials and
exponentially decaying spectrum envelope, from one partial to another. In
other words, km '^k\ and A^(n) = A'^(n) for m = 1, . , M, where M is
assumed to be known.
Prom this sequential setting, Cemgil et al. derive a probabilistic model.
First, it is assumed that the note frequencies belong to a frequency grid with
K nodes. The note fundamental partial frequencies are assumed to be one of
the K predefined frequency positions among, for example, note frequencies of
the tempered scale over several octaves. Second, each of the grid frequencies is
assumed to be in one of the two states 'mute' and 'sound' at each time instant.
In the following, we denote by e/c(^) the state of the frequency grid point with
index k at time n, and we have either e/e(n) 'sound' or e/e(n) = 'mute'.
Coming back to the model in (7.29)-(7.31), the 'sound' case corresponds
to a damping parameter, denoted A^""""^, and the 'mute' case corresponds
to another damping factor A"""*^, with A^""""^ > A"'"*^ From these damping
factors, Cemgil et al. define two matrices A(/c,efc(n)) for efc(n) = 'sound' and
efc(n) = 'mute' by replacing A(n) in (7.31) with either A"^"*^ or A^""^ The
probabilistic model is written for each of the K note frequencies in the grid.
For n = 1 , . . . , T and for a frequency k in the grid,
efc(n) ~ p(efc(n)|efc(n-l)). (7.32)
Moreover, if no onset occurs, that is, if ek{n) = efc(n 1) or if efc(n) =
'mute' whereas efc(n 1) 'sound', then the damped sum-of-sines model
with damping factor A^""^ or A"""'^ is active:
0fc(n) - Ar((9fc(n);A(/c,efc(n-l))0fc(n-l),2;__t), (7.33)
or, if an onset occurs, that is, if efc(n) = 'sound' whereas efc(n 1) = 'mute',
then the amplitude of the sine wave at this frequency is re-initialized by sam-
pling a new initial vector 0fc(n) as follows:
e^{n) ~ Ar((9fc(n);02M,^onset). (7.34)
The acoustic waveform is modelled overall as the sum over all the frequencies
in the grid as
x{n) = ^ C 0 f c ( n ) + e(n), (7.35)
k
where e(n) is a Gaussian white noise with variance cr^. Equation (7.35) yields
the model likelihood at each time n (also termed the observation probability
220 Manuel Davy
Cemgil et al. propose to implement MAP estimation, that is, estimate the
sequence e^ = [e;c(l),..., e/c(T)] for each frequency k in the grid, where e^^ is
called a piano roll, as explained in Chapter 1. Note that the piano roll contains
all the necessary information for multiple FO estimation. Let us denote by
e^MAP the estimated piano-roll sequence. It is computed as
where the hyperparameter vector ip includes cr^, 27onset, ^no onset, A^"'"''^ and
^mute rYYie postcrior p(eK|x) is obtained by integrating out the parameters
6k{n) for all k and all n = 1, ., Tdenoted by the shorthand 6K{^ T)as
follows:
p(eK,i/^|x) = / p ( e K , 0 K ( l : T ) , V ^ | x ) d ^ x ( l : r ) , (7.37)
Je
where the full parameter posterior is given by Bayes's rule
T
p{eK.OKil:T)\xP)=l[ p(efc(l), 0^(1)) n P(efc(^)|efc(n - 1), t/?)
k
T 1
X
n=2
(7.40)
In (7.40) above, the transition pdf p[Ok{n)\0kin 1),'0) niay correspond to
either the 'onset' given in (7.34) or the 'no onset' pdf, (7.33). The initial pdf
of the states and state parameters is given by p(e/c(l), 0^(1)) for all k in the
frequency grid.
The maximization in (7.36) is made complicated by the 'nuisance' parame-
ters OK{^ T) and t/?, which need to be integrated out. Actually, the piano roll
contains all the information for multiple FO estimation, though the hyperpara-
meters ip may contain information useful for, e.g., instruments classification.
7 Multiple FO Estimation Based on Generative Models 221
where the statistics of the noise en(i) are defined framewise. The model in
(7.41) is local in the sense that its parameters have a local interpretation.
In order to fully define the on-line model from a Bayesian viewpoint, the
sequential prior has to be designed.
Dubois and Davy [158], [160] propose to use Gaussian random walks for both
the frequencies and the amplitude,
p ( k ( l : n ) , a ( l : n ) , J ( l :n),1/^(1 : n ) | x ( l : n ) ) ,
^The probabilities in (7.44) are indicative, and may be adjusted at will, in par-
ticular when J{n) reaches its minimum/maximum value.
^The reader interested in particle filtering may refer to Chapter 2 for a short
introduction, and to [152] for a full survey.
7 Multiple FO Estimation Based on Generative Models 223
successive values, and thus they are duplicated in order to better explore the
parameter space around the most probable values.
Initialization
Step 7 . 2 . 1 Initialize the particles at time n = 1.
- For particles j 1,... ,N, sample the initial number of notes J^^\l), the initial
frequency vector k^-^^(l), and the partial amplitudes a^^\l), using some prob-
ability density possibly derived from deterministic estimatioji algorithm applied
to the initial frame 5^. Sample the initial hyperparameter '0^^^(1).
Iterations, for n = 1, 2 , . . .
Step 7 . 2 . 2 The particles are updated.
- For particles j = 1,... ,N, sample thenumbet^of notes using the proposal dis-
tribution q j as follows: J^^\n) - q j ( J ^ ^ ^ ( n ) | j ( ^ ) ( n - l ) , V ^ ( ^ ^ ( n - l ) , x ( l : n ) ) .
- For j = ] ^ , . . . , A/^, sample the frequencies using the proposal distribution qk as
follows: k^^\n) - qi^{k^^\n)\k^^\n - l),J^^\n),ip^^\n - l),x{l : n)).
- For particles j 1 , . . . , TV, update the amplitude vector a^^^ (n) using a Kalman
filter applied to the amplitude evolution model in (7.43) and the likelihood
defined in (7.41), where the other parameters are set to J^^\n), k^^^(n) and
- For j 1 , . . . , A/", updat^ the hyperparameter vector using the proposal dis-
tribution q ^ as follows: \l^^^\n) ~ q^; ('0^"^^(^)|V^^"^H^ ~ 1), a^-^^(n), k^-^^(n),
The approach proposed by Vincent and Plumbley [646] uses a model equiva-
lent to that of (7.41), but written in a different form: a cosine and an initial
phase component are used instead of a sine and a cosine, namely
224 Manuel Davy
J{n)M^{n)
= E' ^
j=\ ?n=l
aj,m(n)w(^)cos(27rA;J,m(r^)^ + <^j,m(r^))+e(i), (7.45)
where harmonicity is assumed in each frame, i.e., the partial frequencies are
kj^rn{fT^) mkj^i{n). Similar to the approach by Cemgil et al., it is assumed
that the fundamental frequency belongs to a fixed grid; here, the MIDI semi-
tone scale. The noise probabilistic model is defined in the frequency domain:
roughly, it is Gaussian where the variance in a given auditory frequency band
is proportional to the loudness of s^{i) in that frequency band; see [646].
The parameter estimation procedure is made of two steps, using the
Bayesian setting. First, parameter priors are defined in each frame, indepen-
dently of neighbouring frames. Unknown parameters are estimated framewise
using MAP. Then in the second step, the parameters in different frames are
linked together using sequential priors, and they are re-estimated.
The framewise local priors are defined as follows. Assume k^'iin) is the
frequency in the grid corresponding to the fundamental partial of note j , in
the current frame. The true fundamental partial frequency is assumed to be
close to kjy{n), namely,
grid frequencies in the 'sound' state. The other parameters (amplitudes and
7^ factor) have been integrated out using the Laplace approximation method;
see [646]. The actual estimation procedure is based on a greedy, local search
for the states e/c(n), and a gradient-style optimization for the frequencies kj^i.
The note parameters being estimated independently in each frame, they
may now be connected across the frames. Vincent and Plumbley define a se-
quential prior p(efc(n)|eA:(n-l)), and the Viterbi algorithm is used to estimate
the sequence e^MApCl T) from the individual framewise estimates e^MAp(^)*
In other words, a grid frequency can be in the state 'sound' in exMAp(l T) at
time n only if it is already in that state in the local estimate eKMAp(^)- The fi-
nal step consists of re-estimating the actual frequencies and amplitudes from
the posterior p(A:,-i(l : T ) , a , , ^ ( l : ^ ) , 7 , ( 1 : T)\x{l : T ) , e ^ ^ ^ p ( l : T))
that is, given the optimal sequence exuApi^ ^)where the priors in (7.46)
and (7.47) and the prior over 7j (n) have been redefined as log-Gaussian ran-
dom walks.
We present four approaches proposed for general audio processing. All of them
follow harmonic trajectories in the acoustic waveform spectrogram.
The method of Yeh and R5bel [681] applies framewise a model similar to that
of (7.9). This model assumes parallel evolution of the partial amplitudes,
together with the spectral smoothness principle of Klapuri [351] and the
inharmonicity model of Davy and Godsill [122]. The parameter estimation
algorithm is computationally simpler than MCMC, though it is also based on
the generation of candidate notes. Each candidate is evaluated using a score
function, and the best candidates are kept in the final list. This method is
applied framewise.
Dubois and Davy [159], [158] extend their method presented in the previous
section to the case where the model is written in the spectrogram domain.
More precisely, the model in (7.41) is changed into
226 Manuel Davy
| jJ{n)
( n ) M^
Mj ^
|DFT.w(.)(/)|^
j=l 771=1
(7.49)
where the noise en{l) is assumed to be zero-mean white Gaussian. The co-
sine term, which was previously used to represent the initial phase, would
be redundant as we consider the power spectrum, and it has been omitted.
The sequential priors are defined as in (7.42)-(7.44). This model is no longer
linear in the amplitudes, and a particle filter close to that in Algorithm 7.2 is
devised; see [158].
An earlier work is that of Sterian et al. [609]. A Kalman filter was used to
extract sinusoidal partials from the signal. Then, these partials were grouped
into their sources by implementing the grouping principles of Bregman [49] in
terms of individual likelihood functions aimed at evaluating, e.g, the harmonic
concordance. The implementation was based on multiple hypothesis tracking.
always voiced; that is, portions of a speech signal may not be well modelled
as a noisy sum of sines. Algorithms for multiple-speakers speech tracking take
this feature into account. Moreover, voiced speech is usually quite harmonic,
and thus speech partial frequency models do not incorporate inharmonicity.
Apart from Tabrikian et al. [616] which is for single pitch, Bach and Jordan [25]
and Wu et al. [677] (see Chapter 8) propose multipitch algorithms for speech
processing based on the acoustic waveform spectrogram, or correlogram. The
inference is based on variants of the Viterbi algorithm.
Though developed for a problem different from multiple FO estimation in
musical signals, methods developed for speech yield interesting frameworks
that may be adapted to music.
7.5 Conclusions
In this chapter, several methods for multiple FO estimation have been pre-
sented. Most of them rely on a generative model of the acoustic waveform,
and some model the signal spectrogram. Probabilistic models are defined in
order to make estimation of the generative model parameters feasible.
These methods are quite powerful in the sense that they capture a large
fraction of the information from the acoustic waveform, and this information
may be used for tasks other than multiple FO estimation. Moreover, Bayesian
approaches clearly distinguish model construction from inference (that is, pa-
rameter estimation algorithms). The drawback is that they can be compu-
tationally intensive. However, their computational cost may be reduced by
plugging as many heuristics as possible into the algorithms, for example to
build the proposal distributions in Monte Carlo algorithms, as this makes the
convergence faster. Designing a method of this kind can also be viewed as
a principled way to built algorithms with known theoretical properties (con-
vergence speed, estimation error, etc.). Heuristics can be also used to design
the unknown parameter priors. Finally, generative models may be specifically
designed for various transcription tasks. Models for melody transcription in
complex polyphonic music can surely be based on spectrogram modelling (see
Section 7.4) and are simpler than precise generative models aimed at estimat-
ing the subtle expressive controls of a guitar player.
Work in this vein is in its infancy. More research has to be done to use
as much prior information as possible, and to define more elaborate models,
which could also be used for, e.g., percussion transcription. This will probably
result in quite complex algorithms, but complexity is the main issue in multiple
FO estimation, and it may not be avoided.
8
Auditory Model-Based Methods for Multiple
Fundamental Frequency Estimation
Anssi Klapuri
8.1 Introduction
10 15 20 25 30 2 3 4 5
Time (ms) Frequency (kHz)
Fig. 8.1. A harmonic sound in the time and frequency domains. The example rep-
resents a vioUn sound with fundamental frequency 290 Hz and fundamental period
3.4 ms.
^More exactly, all instruments in the chordophone and aerophone families (see
Table 6.1 on p. 167).
232 Anssi Klapuri
i'
| 0
Plllllllf
10 20 30 2000 4000 6000
Time (ms) Frequency (Hz)
Fig. 8.2. A vibraphone sound (FO 330 Hz) illustrated in the time and frequency do-
mains. In the right panel, the frequencies of the most dominant spectral components
are shown in relation to the FO.
makes sense, since all the pitched musical sounds described above are periodic
or almost periodic in the time domain. As reported in [134], quite accurate
single FO estimation can be achieved simply by an appropriate normalization
of the short-time autocorrelation function (ACF), defined as
N-l
n=0
The FO of the signal x{n) can be computed as the inverse of the lag r that
corresponds to the maximum of r{r) within a predefined range. To avoid
detecting an integer multiple of the period, short lags have to be favoured
over longer ones.
An implicit way of measuring time-domain periodicity is to match a har-
monic pattern to the signal in the frequency domain. According to the Fourier
theorem, a periodic signal with period r can be represented with a series of
sinusoidal components at the frequencies j / r , where j is a positive integer.
This can be observed for the musical sounds in Figs. 8.1 and 8.2. Algorithms
that are based on frequency-domain harmonic pattern matching have been
proposed in [153], [54], [428], for example.
Another class of FO estimators measure the periodicity of the Fourier spec-
trum of a sound [384], [380]. These methods are based on the observation that
a harmonic sound has an approximately periodic magnitude spectrum, the pe-
riod of which is the FO. In its simplest form, the autocorrelation function p{m)
over an A^'-length magnitude spectrum is calculated as
N/2-m-l
P('^)-N ^ \X{k)\\X{k + m)\. (8.3)
fc=0
In the above formula, any two frequency components with a certain spectral
interval m support the corresponding FO. The spectrum can be arbitrarily
shifted without affecting the output value. An advantage of this is that the
calculations are somewhat more robust against the imperfect harmonicity of
plucked and struck string instruments since the intervals between the overtone
partials do not vary as much as their absolute frequencies deviate from the
harmonic positions. However, in its pure form this approach has more draw-
backs than advantages. In particular, estimating low FOs is not reliable since
the FO resolution of the method is linear whereas the time-domain ACF leads
to 1/F resolution.
An interesting difference between the FO estimators in (8.2) and (8.3) is
that measuring the periodicity of the time-domain signal is prone to errors
in FO halving because the signal is periodic at twice the fundamental period
too, whereas measuring the periodicity of the magnitude spectrum is prone to
errors in FO doubling because the spectrum is periodic at twice the FO rate,
too. The two approaches can be combined using an auditory model, as will
be described in Section 8.3.2.
234 Anssi Klapuri
Fig. 8.3. An illustration of the cochlea (left) and its cross-section (middle). The
right panel shows a rough computational model of the cochlea.
Typically about 100 filters are used with centre frequencies uniformly dis-
tributed on a nearly logarithmic frequency scale (details in Section 8.3.1).
The outputs of individual filters simulate the mechanical movement of the
basilar membrane at different points along its length.
2. The signal at each band, or auditory channel^ is processed to model the
transform characteristics of the inner hair cells which produce neural im-
pulses in the auditory nerve. In signal processing terms, this involves three
main characteristics: compression and level adaptation, half-wave rectifi-
cation, and lowpass filtering (details in Section 8.3.2).
In the following, the acoustic input signal is denoted by x{n) and the
impulse response of an auditory filter by gd'f^)^ where c is the channel index.
The output of the auditory filter at channel c is denoted by Xc{n) and functions
as an input to the second step. The output of the inner hair cell model is
denoted by Zc{n) and represents the probability of observing a neural impulse
at channel c.
The processing mechanisms in the brain can be studied only indirectly and
are therefore not as accurately known. Typically the relative merits of differ-
ent models are judged according to their ability to predict the perception of
human listeners for various acoustic stimuli in psychoacoustic tests. Different
theories and models of the central auditory processing will be summarized in
Section 8.3.3, but in all of them, the following two processing steps can be
distinguished:
3. Periodicity analysis of some form takes place for the signals Zc{n) within
the auditory channels. Phase differences between channels become mean-
ingless.
4. Information is integrated across channels.
In the above processing chain, the auditory nerve signal Zc{n) represents
a nice 'interface' between the Steps 2 and 3 and thus between the peripheral
and central processes. The signal in the auditory nerve has been directly
measured in cats and in some other mammals and this is why the stages 1
and 2 are quite well known. Computational models of the peripheral hearing
can approximate the auditory-nerve signal quite accurately, which is a great
advantage since an important part of the processing already takes place at
these stages. However, central processes and especially Step 3 are (arguably)
even more crucial in pitch perception. The above four steps are now described
in more detail.
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Frequency (Hz)
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Frequency (Hz)
Fig. 8.4. Frequency responses of a few auditory filters shown on the logarithmic
{top) and on the linear magnitude scale (bottom). The dashed line in the upper
panel shows the summary response of the filterbank when 70 auditory filters are
distributed between 60 Hz and 7 kHz.
can be modelled with a bank of linear bandpass filters: Figure 8.4 shows an
example of such a filterbank.
The bandwidths and the shape of the power response of the auditory filters
have been studied using the masking phenomenon [192], [499]. Masking refers
to a situation where an audible sound becomes inaudible in the presence of
another, louder sound. In particular, if the distance between two spectral com-
ponents is less than a so-called critical bandwidth^ one easily masks the other.
The situation can be thought of as if the components would go to the same
auditory filter, or to the same channel in the auditory nerve. If the frequency
separation is larger, the components are coded independently and are both
audible.
The bandwidths of the auditory filters can be conveniently expressed us-
ing the equivalent rectangular bandwidth (ERB) concept. The ERB of a fil-
ter is defined as the bandwidth of a perfectly rectangular filter which has a
unity magnitude response in its passband and an integral over the squared
magnitude response which is the same as for the specified filter. The ERB
bandwidths be of the auditory filters have been found to obey
In the above expression, / denotes frequency in Hertz and ^(/) gives the
critical-band scale. When / varies between 0 Hz and 20 kHz, ^(/) varies
between 0 and 42. Intuitively, this means that approximately 42 critical bands
(or auditory filters) would fit within the range of hearing if the passbands of
the filters were non-overlapping and rectangular in shape. Conversion from
the critical-band scale back to Hertz units is given by
20 40 10 0 1000 2000
Time (ms) Time (ms) Frequency (Hz)
Fig. 8.5. Impulse responses of two gammatone filters with centre frequencies 100 Hz
(left) and 1.0 kHz (middle). The frequency response of the latter filter is shown on
the right.
frequency of the filter, /c, windowed with a function that is precisely the
gamma distribution from statistics. Frequency responses of several gamma-
tone filters are shown in Fig. 8.4.
The gammatone filters can be implemented efficiently using a cascade of
four second-order IIR filters. A detailed description of the design of the filter-
bank and the corresponding source code can be found in the technical report
by Slaney [591].
Inner hair cells (IHC) are the elements which convert the mechanical motion of
the basilar membrane into firing activity in the auditory nerve. Each IHC rests
at a certain point along the basilar membrane and thus follows its movement
at this position. Correspondingly, in the computational models the output of
each auditory filter is processed by an IHC model.
The IHCs produce neural impulses, or 'spikes', which are binary events.
However, since there is a large population of the cells, it is conventional to
model the firing probability as a function of the basilar membrane movement.
Thus the input to an IHC model comes from the output of an auditory filter,
Xc{n), and the output of the IHC model represents the time-varying firing
probability denoted by Zc{n).
Several computational models of the IHCs have been proposed. An exten-
sive comparison of eight different models was presented by Hewitt and Meddis
in [291]. In the evaluation, the model of Meddis [456] outperformed the others
by showing only minor discrepancies with the empirical data and by being
also one of the most efficient computationally. An implementation of this
model is available in the AIM [501] and HUTear [273] auditory toolboxes, for
example.
A problem with the realistic IHC models is that they depend critically on
the absolute level of their input signal. The dynamic range of the model of
Meddis [456], for example, is only 25 dB and the firing rate saturates at the
60 dB level. This limitation of individual IHCs is real, and it seems that the
auditory system uses a population of IHCs with different dynamic ranges to
8 Auditory Model-Based Methods for Multiple FO Estimation 239
I
t
10
0
3
1
E
^ -10 5 0
5 10 2000 4000 6000
Time (ms) Frequency (Hz)
10
$ Q UUU/J\JWWU/J\AA/UUIAAAA^
. i 0.5
c
I -10 S 0.
5 10 2000 4000 6000
Time (ms) Frequency (Hz)
1 2 3 0.5
|-2f a
CO
Fig. 8.6. Upper panels show a signal consisting of the overtone partials 13-17 of a
sound with FO 200 Hz (fundamental period 5 ms) in the time and frequency domains.
Middle panels illustrate the signal after half-wave rectification. Lower panels show
the result of lowpass filtering the rectified signal with a 1 kHz cut-off.
samples [457], [459].^ It should be noted that the data structure at this stage
was three dimensional {cxrxn). Across-channel information integration was
then done simply by summing across channels, resulting in a summary ACF
Pitch at time n was estimated by searching the highest peak in s{n^ r) within
a predefined lag range [457, p. 2884].
Meddis and Hewitt demonstrated that the model was able to predict the
perceived pitch for a large set of test stimuli used previously in psychoa-
coustic tests [457]. Moreover, Meddis and O'Mard later noted that the im-
plementation is a special case of a more general model consisting of four
stages: (i) cochlear bandpass filtering, (ii) half-wave rectification and lowpass
filtering, (iii) within-channel periodicity extraction, and (iv) across-channel
aggregation of periodicity estimates [459]. This became known as the unitary
model of pitch perception because the single model was capable of simulating
a wide range of pitch perception phenomena. Different variants of the unitary
model have been used since then in a number of signal analysis systems [171],
[133], [627], [677].
Cariani and Delgutte carried out a direct experiment to find out the char-
acteristics in the auditory nerve signals that correlate with the perceived pitch
[67]. Instead of using a simulated cochlea, the authors studied the signal in
the auditory nerve of a cat in response to complex acoustic waveforms. They
found that the time intervals between neural spikes are particularly important
in encoding pitch. The authors computed histograms of time intervals between
both successive and non-successive impulses in individual auditory nerve fi-
bres, and summed the histograms of 507 fibres to form a pooled histogram.
What the authors noticed was that, for a diverse set of audio signals, the
perceived pitch correlated strongly with the most frequent interspike interval
in the pooled histogram at any given time [67]. This suggests that the pitch of
these signals could result from central auditory processing mechanisms that
analyse interspike interval patterns. Computational models of the cochlea do
not produce discrete neural spikes but rather real-valued signals Zc{n), which
represent the probability of a neural firing (in diff'erent nerve fibres). How-
ever, Cariani and Delgutte noted that the interspike interval codes are closely
related to autocorrelation operations [67, p. 1712]. For a real-valued signal,
ACF can replace the interval histogram.
Despite the above strong evidence, it seems that the ACF is not precisely
the mechanism used for periodicity estimation in the central auditory system,
but some experimental and neurophysiological findings contradict the ACF
(see e.g. [322] and the brief summary in [131, p. 1262]). Meddis and Hewitt,
for example, used the ACF but wanted to 'remain neutral about the exact
^In practice, the windowing and summing can be implemented very efficiently
using a leaky integrator.
8 Auditory Model-Based Methods for Multiple FO Estimation 243
0 50 100 150 100 200 300 400 500 100 200 300 400 500
Time (ms) Frequency (Hz) Frequency (Hz)
Fig. 8.7. The impulse response {left) and frequency response (middle) of a comb
filter with the feedback delay of 10 ms and feedback gain 0.9. For comparison, the
right panels shows the power response of the ACF for 10 ms lag.
where r is the feedback delay and 0 < a < 1 is the feedback gain.
Figure 8.7 shows the impulse response and the frequency response of a
comb filter with a feedback delay r = 10 ms. For comparison, the power re-
ponse of the A C F for t h e corresponding lag r is shown in the rightmost panel."^
As can be seen, the comb filter is more sharply tuned to the harmonic frequen-
cies of the period candidate and no negative weights are applied between these.
Periodicity analysis with comb filters can be accomplished by invoking a
bank of such filters with diflPerent feedback delays r and by computing locally
time-averaged powers at the o u t p u t s of the filters. Figure 8.8 illustrates the
o u t p u t powers of a bank of comb filter for a couple of test signals. In t h e case
of a periodic signal, all comb filters t h a t are in rational-number relations to
the period of the sound show response to it, as seen in panel (b).
A bank of comb filters has been proposed for auditory processing e.g. by
Cariani [66, Eq. (1)], who used the filterbank to separate concurrent vowels
with different FOs. Cariani also proposed a non-linear mechanism which con-
sisted of an array of delay lines, each associated with its characteristic delay
and a non-linear feedback mechanism instead of the linear one in (8.12). Pe-
riodic sounds were reported to be captured by the corresponding delay loop
and thus became segregated from the mixture signal. T h e strobed temporal
^As a non-linear operation, the ACF does not have a frequency response. How-
ever, since the ACF of a time-domain signal is the inverse Fourier transform of its
power spectrum, the power response of the ACF can be depicted for a single period
value.
244 Anssi Klapuri
m
l| a. i d.i
0.8 0.5
0.6
0.4
0.2
o'Li-u ^ '^ Li
1 24 48 72 96 24 48 72 96 24 48 72 96
-0.5
24 48 72 96
X (samples) X (samples) T (samples) X (samples)
Fig. 8.8. Normalized output powers of a bank of comb filters for (a) a sinusoidal
with 24-sample period and (b) an impulse train with the same period. The feedback
delays of the filters are shown on the x-axis and all the feedback gains were 0.9. The
panels (c) and (d) show the ACFs of the same signals, respectively.
Fig. 8.9. Illustration of the log-lag correlogram of Ellis [171]. Input signal in this case
was a trumpet sound with FO 260 Hz (fundamental period 3.8 ms). The left panel
illustrates the three-dimensional correlogram volume. The middle panel shows the
zero-lag face of the correlogram which is closely related to the power spectrogram.
The right panel shows one time slice of the volume, from which the summary ACF
can be obtained by summing over frequency.
the analysis result in practice. The idea of using the same data representation
as the human auditory system is therefore very appealing. The aim of this
section is to investigate the advantages and disadvantages of doing this and
to introduce a few auditory-model implementations that have been employed.
For this purpose, three different music transcription systems are briefly intro-
duced. A discussion of other mid-level data representations in acoustic signal
analysis can be found in [173] and in Chapter 3.
Godsmark and Brown proposed a system for modelling the auditory scene
analysis (ASA) function in humans, that is, our ability to perceive and recog-
nize individual sound sources in mixture signals [215]. The authors used music
signals as their test material. ASA is usually viewed as a two-stage process
where a mixture signal is first decomposed into time-frequency components
of some kind, and these are then grouped to their respective sound sources.
In humans, the grouping stage has been found to depend on various acoustic
properties of the components, such as their harmonic frequency relationships,
common onset times, or synchronous frequency modulation [49].
Godsmark and Brown used the auditory model of Cooke [100] for the de-
composition stage. This auditory model also uses a bank of gammatone filters
at its first stage. Notable in Cooke's model is that rectification and lowpass
filtering are not applied at the filterbank outputs but only the compression
and level adaptation properties of the IHCs are modelled, amounting to an au-
ditorily motivated bandwise gain control. Thus the overall model can actually
be viewed as a sophisticated way of extracting sinusoidal components from an
input signal, instead of being a complete and realistic model of the auditory
periphery. The frequency of the most prominent sinusoidal component at the
output of each auditory filter is tracked through time using median-smoothed
instantaneous-frequency estimation [100, p. 36] and, in addition, the instan-
taneous amplitudes of the components are calculated. Since the passbands of
8 Auditory Model-Based Methods for Multiple FO Estimation 247
the gammatone filters overlap, usually several adjacent filters show response
to the same frequency component. This redundancy is removed by combining
the outputs of adjacent channels so as to form 'synchrony strands' which rep-
resent the time-frequency behaviour of dominant spectral components in the
input signal.
The main focus in the work of Godsmark and Brown was on developing a
computational architecture which would facilitate the integration of different
spectral organization (grouping) principles [215]. The synchrony strands were
used as the elementary units that were grouped to sound sources. The authors
reported that these were particularly suitable for modelling the ASA because
the temporal continuity of the strands is made explicit and they are sufficiently
few in number to perform the grouping for every strand.^ Godsmark and
Brown computed various acoustic features for each strand and then performed
grouping according to onset and offset synchrony, time-frequency proximity,
harmonicity, and common frequency movement.
Godsmark and Brown evaluated their model by investigating its ability
to segregate polyphonic music into its constituent melodic lines. This in-
cluded both multiple FO estimation and organization of the resulting notes
into melodic lines according to the applied musical instruments. The latter
task was carried out by computing pitch and timbre proximities between suc-
cessive sounds. Although transcription accuracy as such was not the main
goal, promising results were obtained for musical excerpts with polyphonies
ranging from one to about four simultaneous sounds.
Good transcription results were reported for a test set of three real and
three synthesized piano performances. Concerning the use of the auditory
model, Marolt reported that the compression and level adaptation properties
of Meddis's IHC model were important to the system as they reduced the
dynamic range of the signal and thus enabled the system to track small-
amplitude partials.
of a musical chord match the overtones of a non-existing chord root and the
highest peak in the summary ACF indicates the chord root instead of one
of the component sounds.^ Further, the pitch models do not address robust-
ness against additive noise: drum sounds often accompany the pitched sounds
in music. Finally, the computational complexity of the models is rather high
since they involve periodicity analysis at a large number of sub-bands.
On the other hand, there are several issues that are quite efficiently dealt
with using a pitch model. These were summarized in Section 8.4.4 above.
In the following, a number of different methods are described that aim at
overcoming the above-mentioned shortcomings. Some of these were designed
for two-speaker speech signals but are included here in order to cover the
substantial amount of work done in the analysis of multiple-speaker speech
signals. This is followed by a more detailed description of two multiple FO
estimation methods for music signals. It should be noted that the main interest
in this section is not to model hearing but to address the practical task of
multiple FO estimation.
Channel Selection
Meddis and Hewitt extended their pitch model (see p. 241) to simulate the
human ability to identify two concurrent vowels with different FOs [458]. The
proposed method included a template-matching process to recognize the vow-
els too, but here only the FO estimation part is summarized. It consists of the
following steps:
^Examples of such chords are the major triad and the interval of a perfect fifth.
250 Anssi Klapuri
1. The pitch model of Meddis and Hewitt is appUed [457]. This involves
a bank of gammatone filters, Meddis's IHC simulation, within-channel
ACF computation, and across-channel summing. The highest peak in the
summary ACF within a predefined lag range is used to estimate the FO
of the more dominant sound.
2. Individual channel ACFs that show a peak at the period of the first de-
tected FO are removed. If more than 80% of the channels get removed,
only one FO is judged to be present and the algorithm terminates.
3. The ACFs of the remaining channels are combined into a new summary
ACF from which the FO of the other vowel is derived.
The authors did not give statistics on the FO estimation accuracy, but
reported clear improvements in vowel recognition as the FO difference of the
two sounds was increased from zero to one semitone or beyond.
Time-Domain Cancellation
where N is the analysis frame size [131].^ By expanding the square, it can
be seen that SDF(n,r) = E{n) -h E{n r) 2r(n,T), where E{n) denotes
the signal power at time n and r(n,T) is the ACF. Thus the SDF and the
ACF are functionally equivalent, and period estimation can be carried out by
searching for minima in the SDF instead of maxima in the ACF. De Cheveigne
also proposed a joint cancellation model, where two cancellation filters with
periods TA and TB were applied in a cascade so as to cancel two periodic
sounds. By computing the power of the resulting signal as a function of the two
periods, the FOs were found by locating the minumum of the two-dimensional
function [129], [133].
De Cheveigne evaluated both the iterative and the joint FO estimation
method for mixtures of two-voiced speech segments [129]. The iterative algo-
rithm was reported to produce estimates which were correct within 3% accu-
racy in 86% of the frames and the exhaustive joint estimator produced correct
estimates in 90% of the frames. Computational complexity is a drawback of
the joint estimator.
Input
Highpass Half-wave rectify Periodicity
_1_
Pre-whitening
at 1 kHz Lowpass at 1 kHz detection Summary ACF
Lowpass Periodicity > ^ enhancer
at 1 kHz detection
Enhanced
summary ACF
Fig. 8.10. Block diagram of the pitch analysis method proposed by Karjalainen
and Tolonen [627]. 2005 IEEE, reproduced here by permission.
signal X is the inverse Fourier transform of its power spectrum [276, p. 334].
The generalized ACF, then, is defined as
where DFT and IDFT denote the discrete Fourier transform and its inverse,
and a is a free parameter which determines the frequency domain compres-
sion.^ The standard ACF is obtained by substituting a = 2. Definition of
the cepstrum of x is analogous to ACF and is obtained by replacing the sec-
ond power with the logarithm function. The difference between the ACF and
cepstrum-based FO estimators is quantitative: raising the magnitude spectrum
to the second power emphasizes spectral peaks in relation to noise but, on
the other hand, further aggravates spectral peculiarities of the target sound.
Applying the logarithm function causes the opposite for both. And indeed,
ACF-based FO estimators have been reported to be relatively noise immune
but sensitive to formant structures in speech, and vice versa for cepstrum-
based methods [535]. As a trade-off, Karjalainen and Tolonen suggested using
the value a = 0.67.
Extension to multiple FO estimation was achieved by cancelling subhar-
monics in the summary ACF (SACF) by cHpping the SACF to positive values,
time-scaling it to twice its length, and by subtracting the result from the orig-
inal clipped SACF. This cancellation operation was repeated for time-scaling
factors up to about five. From the resulting enhanced SACF, all FOs were
picked without iterative estimation and cancellation. In more detail, the en-
hancing procedure was as follows:
1. The enhanced SACF s{r) Is Initialized to be equal to the SACF S(T). The scaling
factor m is initialized to value 2.
2. The original SACF is time-scaled to m times its length and the result is denoted by
Smir). Using linear Interpolation,
IAJLAJULXXJ SACF
SACF
Low-channel ACF
0 5 10 15 0 5 10 15 5 10 15
Lag (ms) Lag (ms) Lag (ms)
Fig. 8.11. Left: The ACFs at the low and the high channel for a viohn sound (FO
523 Hz). Middle: SACF and enhanced SACF for the same sound. Right: SACF and
enhanced SACF for a major triad chord played by the trumpet (FOs 220 Hz, 277 Hz,
and 330 Hz). The circles indicate the correct fundamental periods.
it partly solves the 'chord root' problem mentioned in the beginning of Sec-
tion 8.5 since the enhancing procedure scales the true FO peaks to the position
of the chord root and, if a note does not truly appear at the root, the spu-
rious peak becomes cancelled. The only place where care has to be taken is
in setting values of the original SACF to zero in the lag range [0, /s/1000 Hz]
before the enhancing (here /s denotes the sampling rate). This ensures that
the values on the r = 0 hill do not spread and wipe away important infor-
mation. Zeroing the mentioned lags causes no harm for the analysis since the
algorithm cannot detect FOs above 1 kHz.
Figure 8.11 illustrates the enhancing procedure for an isolated sound and
for a musical chord. As mentioned by Martin [439], the SACF indicates the
non-existing FO of the chord root in the latter case. After enhancing, however,
the true FOs are revealed.
Overall, the method of Karjalainen and Tolonen is quite accurate and it
has been described in sufficient detail to be exactly implementable based on
[627] and on the Matlab toolbox for frequency-warped signal processing by
Harma et al. [272]. A drawback of the method as stated by the authors is
that it is 'not capable of simulating the spectral pitch' [627, p. 713], i.e., the
pitch of a sound whose first few harmonics are above 1 kHz. In practice, the
method is most accurate for FOs below about 600 Hz. Later, Karjalainen and
Tolonen also proposed an iterative approach to multiple FO estimation using
the described simplified auditory model [328].
Zc(n)
UoCk) Cumulate
Xc(n) Compress, iZe(k)i y Ud(k)
c
CO IDFT(.)I ^
rectify, T
\ U(k) ^
Input Periodicity Normalize,
jfK+j vV ^ detection peak picking
o
<
_u
/"^ Wk)
Fundamental
periods x ,r
Fig. 8.12. Block diagram of the multiple FO estimator of Klapuri [354]. 2005
IEEE, reproduced here by permission.
In the peripheral hearing model, an input signal was first passed through a
bank of gammatone filters with centre frequencies uniformly distributed on
the critical-band scale (see (8.5)) between 60 Hz and 5.2 kHz. A total of 72
filters were employed using the implementation of Slaney [591].
Hair cell transduction was modelled by compressing, half-wave rectifying,
and lowpass filtering the sub-band signals. The compression was implemented
by simulating the full-wave i/th law compression (FWC), which is defined as
x^, X > 0,
FWC(x) = (8.18)
-{-xY, x<0.
For a narrow-band signal, such as the output of an auditory filter, the effect
of the FWC within the passband of the filter can be accurately modelled by
simply scaling the signal with a factor
Periodicity Analysis
where the time index n has been omitted to simplify the notation in the
following. The SMS functioned as an intermediate data representation and all
the subsequent processing took place using it only.
Figure 8.13 illustrates the bandwise magnitude spectra |Zc(A:)| for a saxo-
phone sound. As can be seen, the within-channel rectification maps the contri-
bution of higher-order partials to the position of the FO and its few multiples
in the spectrum. Most importantly, the degree to which an individual overtone
partial j is mapped to the position of the fundamental increases as a function
of j . This is because the auditory filters become wider at higher frequencies
and the partials thus have larger-magnitude neighbours with which to gener-
ate the difference frequencies (beating) in the envelope spectrum. Klapuri's
method was largely based on this observation, as will be explained below.
The second modification concerned the function cos(-) in (8.20), which can
be seen as a harmonic template that picks overtone partials of the frequency
K/T in the spectrum (see the rightmost panel of Fig. 8.7 on p. 243). The func-
tion was replaced by a response that is more sharply tuned to the frequencies
8 Auditory Model-Based Methods for Multiple FO Estimation 257
' ' ' 1 1 ' ' I . . 1 . 1 1 1 1 1 . 1 1 1
A 1 . ....111.
^4982
A ...ll....
:i: 3680 A . ,.11.
^2703 A . . 1 1
g 1970 . .1 .
^ 1421 . 1
g' 1009 1 .
' ^ 699 1
g 467 1
293 ft
O 163
65 r=. r1 1 . , 1 , , , n1 , r-,-^ ,1 . . , . lH .
Fig. 8.13. The spectra |-^c(/c)| at a few channels for a tenor saxophone sound (FO
131 Hz).
where /s denotes the sampling rate and the factors / s / r and Hi,p{k) are
related to the third modification to be explained later. The set Kj^r defines a
narrow range of frequency bins in the vicinity of the jth overtone partial of
the FO candidate / s / r . More exactly, Kj^r = [k^^l.kj^l], where
1,(0)
[jK/{T + AT/2)\+h (8.23)
Tn^x{[jK/{T-AT/2)\M,h (8.24)
^ ^
& & 1 11..
0.1 0.2 0.5 1 2 0.1 0.2 0.5 1 2 5
Frequency (kHz) Frequency (kHz)
3 g
'WA.-A-AJ^ VjuvJl^^ '''^-^^~-'-AJ^-A>vA,^A^^,A>A-JL^AAJL^^^
Fig. 8.14. The upper panels show the summary magnitude spectrum U{k) for a
saxophone sound with FO 131 Hz {left) and a violin sound with FO 1050 Hz (right).
The lower panels show the corresponding salience functions A(r).
illustrates the calculation of A(r) for the saxophone sound shown in Fig. 8.13,
and for a violin sound with the FO 1050 Hz.
where d = 0.5 controls the amount of the subtraction and is a free parameter of
the algorithm.
6. Return to Step 2.
260 Anssi Klapuri
8.5.4 Results
Reference
50
4 6 ms - I
M
40 F0<2.1kHzl I
30
20
10
0
1 2 4 6
Polyphony
Reference
93 ms
I 30 F0<2.1kH2
1 2 4 6
Polyphony
1 2 4 6
Polyphony
20
^ 10
5
0 jjl
1 2
li
4 6
Polyphony
The left-hand panels of Fig. 8.15 show the error rates for the method of
Tolonen and Karjalainen in 46 ms and 93 ms analysis frames. The FO range in
these experiments was limited to the three octaves between 65 Hz and 520 Hz,
because the accuracy of the method was found to degrade rapidly above 600 Hz
(see Section 8.5.2). The black bars show the multiple FO estimation error
rates and the white bars show the predominant FO estimation error rates.
The global maximum of the enhanced SACF was used for the latter purpose.
The method performed robustly in polyphonic mixtures, and especially the
predominant FO estimation error rates remained reasonably low even in short
time frames and in rich polyphonies. Taking into account the computational
efficiency (faster than real-time) and conceptual simplicity of the method, the
results are very good.
The middle panels of Fig. 8.15 show the error rates for the method of Kla-
puri [354]. The first detected FO was used for the predominant FO estimation.
In these experiments, the pitch range was limited to five octaves between
65 Hz and 2.1 kHz. The method performs robustly in all cases and is very
accurate, especially in the 93 ms analysis frame. Computational complexity is
a drawback of this method. The calculations are clearly slower than real-time
on a 2-GHz desktop computer, the most intensive part being the cochlear
filterbank and the within-band DFT calculations.
The right-hand panels of Fig. 8.15 show the error rates for a state-of-the-
art reference method proposed by Klapuri in [351]. This method is based on
262 Anssi Klapuri
ll
0 20 40 60 80 20 40 60 80
EM
0 20 40 60 80 20 40 60 80
Time interval (ms) Time interval (ms) Time interval (ms) Time interval (ms)
Fig. 8.16. Error rates as a function of the interval between the sound onset and
the beginning of a 46 ms analysis frame. The two panels on the left show results
for the method of Tolonen and Karjalainen [627] and the two panels on the right
for the method of Klapuri [354]. The black bars show multiple FO estimation error
rates and the white bars show predominant FO estimation error rates.
8 . 5 . 5 S u m m a r y of t h e M u l t i p l e FO E s t i m a t i o n M e t h o d s
T h e beginning of this section listed several issues where the pitch perception
models fall short of being practically applicable multiple FO estimators. This
section summarizes and discusses the various technical solutions t h a t were
proposed as improvements.
8 Auditory Model-Based Methods for Multiple FO Estimation 263
8.6 Conclusions
Tuomas Virtanen
9.1 Introduction
Computational analysis of polyphonic musical audio is a challenging problem.
When several instruments are played simultaneously, their acoustic signals
mix, and estimation of an individual instrument is disturbed by the other co-
occurring sounds. The analysis task would become much easier if there was
a way to separate the signals of different instruments from each other. Tech-
niques that implement this are said to perform sound source separation. The
separation would not be needed if a multi-track studio recording was available
where the signal of each instrument is on its own channel. Also, recordings
done with microphone arrays would allow more efficient separation based on
the spatial location of each source. However, multi-channel recordings are usu-
ally not available; rather, music is distributed in stereo format. This chapter
discusses sound source separation in monaural music signals, a term which
refers to a one-channel signal obtained by recording with a single microphone
or by mixing down several channels.
There are many signal processing tasks where sound source separation
could be utilized, but the performance of the existing algorithms is still quite
limited compared to the human auditory system, for example. Human listeners
are able to perceive individual sources in complex mixtures with ease, and
several separation algorithms have been proposed that are based on modelling
the source segregation ability in humans (see Chapter 10 in this volume).
Recently, the separation problem has been addressed from a completely
different point of view. The term unsupervised learning is used here to char-
acterize algorithms which try to separate and learn the structure of sources
in mixed data based on information-theoretical principles, such as statisti-
cal independence between sources, instead of sophisticated modelling of the
source characteristics or human auditory perception. Algorithms discussed in
this chapter are independent component analysis (ICA), sparse coding, and
non-negative matrix factorization (NMF), which have been recently used in
268 Tuomas Virtanen
source separation tasks in several application areas. When used for monau-
ral audio source separation, these algorithms usually factor the spectrogram
or other short-time representation of the input signal into elementary com-
ponents, which are then clustered into sound sources and further analysed to
obtain musically important information. Although the motivation of unsuper-
vised learning algorithms is not in the human auditory perception, there are
similarities between them. For example, all the unsupervised learning meth-
ods discussed here are based on reducing redundancy in data, and it has been
found that redundancy reduction takes place in the auditory pathway, too [85].
The focus of this chapter is on unsupervised learning algorithms which
have proven to produce applicable separation results in the case of music
signals. There are some other machine learning algorithms which aim at sep-
arating speech signals based on pattern recognition techniques, for example
[554].
All the algorithms mentioned above (ICA, sparse coding, and NMF) can
be formulated using a linear signal model which is explained in Section 9.2.
Different data representations are discussed in Section 9.2.2. The estimation
criteria and algorithms are discussed in Sections 9.3, 9.4, and 9.5. Methods
for obtaining and utilizing prior information are presented in Section 9.6.
Once the spectrogram is factored into components, these can be clustered
into sound sources or further analysed to obtain musical information. The
post-processing methods are discussed in Section 9.7. Systems extended from
the linear model are discussed in Section 9.8.
When several sound sources are present simultaneously, the acoustic wave-
forms of the individual sources add linearly. Sound source separation is defined
as the task of recovering each source signal from the acoustic mixture. A com-
plication is that there is no unique definition for a sound source. One possi-
bility is to consider each vibrating physical entity, for example each musical
instrument, as a sound source. Another option is to define this according to
what humans tend to perceive as a single source. For example, if a violin, sec-
tion plays in unison, the violins are perceived as a single source, and usually
there is no need to separate the signals played by each violin. In Chapter 10,
these two alternatives are referred to as physical source and perceptual source,
respectively (see p. 302). Here we do not specifically commit ourselves to either
of these. The type of the separated sources is determined by the properties
of the algorithm used, and this can be partly affected by the designer accord-
ing to the application at hand. In music transcription, for example, all the
equal-pitched notes of an instrument can be considered as a single source.
Many unsupervised learning algorithms, for example standard ICA, require
that the number of sensors be larger or equal to the number of sources. In
multi-channel sound separation, this means that there should be at least as
9 Unsupervised Learning Methods for Source Separation 269
where N <^ T is the number of basis functions, and gn,t is the amount of
contribution, or gain, of the n*^ basis function in the t*^ frame. Some methods
estimate both the basis functions and the time-varying gains from a mixed
input signal, whereas others use pre-trained basis functions or some prior
information about the gains.
The term component refers to one basis function together with its time-
varying gain. Each sound source is modelled as a sum of one or more compo-
nents, so that the model for source m in frame t is written as
where Sm is the set of components within source m. The sets are disjoint, i.e.,
each component belongs to only one source.
In (9.1) approximation is used, since the model is not necessarily noise-free.
The model can also be written with a residual term r^ as
N
^t = Yl ^^'*^^ + ^*' t = 1,..., T. (9.3)
n=l
By assuming some probability distribution for the residual and a prior distri-
bution for other parameters, a probabilistic framework for the estimation of
hn and gn,t can be formulated (see e.g. Section 9.4). Here (9.1) without the
270 Tuomas Virtanen
residual term is preferred for its simplicity. For T frames, the model (9.1) can
be written in matrix form as
X BG, (9.4)
where X = [ x i , X 2 , . . . , X T ] is the observation matrix^ B = [bi, b 2 , . . . , bjv]
is the mixing matrix, and [G]n,t 9n,t is the gain matrix. The notation [G]n,t
is used to denote the (n, t)*^ entry of matrix G. The term mixing matrix is
typically used in ICA, and here we follow this convention.
The estimation algorithms can be used with several data representations.
Often the absolute values of the DFT are used; this is referred to as the mag-
nitude spectrum in the following. In this case, x^ is the magnitude spectrum
within frame t, and each component n has a fixed magnitude spectrum b^
with a time-varying gain gn^t- The observation matrix consisting of framewise
magnitude spectra is here called a magnitude spectrogram. Other representa-
tions are discussed in Section 9.2.2.
The model (9.1) is flexible in the sense that it is suitable for represent-
ing both harmonic and percussive sounds. It has been successfully used in
the transcription of drum patterns [188], [505] (see Chapter 5), in the pitch
estimation of speech signals [579], and in the analysis of polyphonic music
signals [73], [600], [403], [650], [634], [648], [43], [5].
Figure 9.1 shows an example signal which consists of a diatonic scale and
a C major chord played by an acoustic guitar. The signal was separated into
components using the NMF algorithm described in [600], and the resulting
components are depicted in Fig. 9.2. Each component corresponds roughly to
one fundamental frequency: the basis functions are approximately harmonic
and the time-varying gains follow the amplitude envelopes of the notes. The
separation is not perfect because of estimation inaccuracies. For example, in
some cases the gain of a decaying note drops to zero when a new note begins.
Factorization of the spectrogram into components with a fixed spectrum
and a time-varying gain has been adopted as a part of the MPEG-7 pattern
recognition framework [72], where the basis functions and the gains are used
as features for classification. Kim et al. [341] compared these to mel-frequency
cepstral coefficients which are commonly used features in the classification of
audio signals. In this study, mel-frequency cepstral coefficients performed bet-
ter in the recognition of sound effects and speech than features based on ICA
or NMF. However, final conclusions about the apphcability of these methods
to sound source recognition have yet to be made. The spectral basis decompo-
sition specified in MPEG-7 models the summation of components on a decibel
scale, which makes it unlikely that the separated components correspond to
physical sound objects.
The model (9.1) presented in the previous section can be used with time-
domain or frequency-domain observations and basis functions. Time-domain
9 Unsupervised Learning Methods for Source Separation 271
5000
4500
4000
3500
3000
o 2500
"2000
1500
1000
500
0
1 1.5 2 3.5
Time (seconds)
Fig. 9.1. Spectrogram of an example signal which consist of a diatonic scale from
C5 to C6, followed by a C major chord (simultaneous notes C5, E4, and G5), played
by an acoustic guitar. The notes are not damped, meaning that consecutive notes
overlap.
Time-Domain Representation
r\MA(WM^A./\^ /LA
M\A,KAAAAAA
ffiO^M^
N\f^^^^r\A^/v.^^^^ni
AJA/JW/\^JKKJ\J\
KflAA/vAAAvAXJ
^/lA.AA/wwu^.
i ^ ^ t U fli i. lA/irYTYiAnriAnr^^-^^^^
1 2 3 1000 2000 3000 4000
Time (seconds) Frequency (Hertz)
Fig. 9.2. Components estimated from the example signal in Fig. 9.1. Basis functions
are plotted on the right and the corresponding time-varying gains on the left. Each
component except the bottom one corresponds to an individual pitch value and the
gains follow roughly the amplitude envelope of each note. The bottom component
models the attack transients of the notes. The components were estimated using the
NMF algorithm [400], [600] and the divergence objective (explained in Section 9.5).
Frequency-Domain Representation
When using a frequency transform such as the DFT, the phases of the
complex-valued transform can be discarded by considering only the mag-
nitude or power spectrum. Even though some information is lost, this also
eliminates the phase-related problems of time-domain representations. Unlike
time-domain basis functions, many real-world sounds can be rather well ap-
proximated with a fixed magnitude spectrum and a time-varying gain, as seen
in Figs. 9.1 and 9.2, for example. Sustained instruments in particular tend to
have a stationary spectrum after the attack transient.
In most systems aimed at the separation of sound sources, DFT and a
fixed window size is applied, but the estimation algorithms allow the use
of any time-frequency representation. For example, a logarithmic spacing of
frequency bins has been used [58], which is perceptually and musically more
plausible than a constant spectral resolution.
The linear summation of time-domain signals does not imply the linear
summation of their magnitude or power spectra, since phases of the source
signals affect the result. When two signals sum in the time domain, their
complex-valued DFTs sum Hnearly, X{k) = Yi{k) + Y2{k), but this equality
does not apply for the magnitude or power spectra. However, provided that
the phases of Yi{k) and Y2{k) are uniformly distributed and independent of
each other, we can write
where E{'} denotes expectation. This means that in the expectation sense,
we can approximate time-domain summation in the power spectral domain, a
result which holds for more than two sources as well. Even though magnitude
spectrogram representation has been widely used and it often produces good
results, it does not have similar theoretical justification. Since the summation
is not exact, use of phaseless basis functions causes an additional source of
error. Also, a phase generation method has to be implemented if the sources
are to be synthesized separately. These are discussed in Section 9.7.3.
The human auditory system has a large dynamic range: the difference
between the threshold of hearing and the threshold of pain is approximately
100 dB [550]. Unsupervised learning algorithms tend to be more sensitive to
high-energy observations. If sources are estimated from the power spectrum,
some methods fail to separate low-energy sources even though they would be
274 Tuomas Virtanen
perceptually and musically meaningful. This problem has been noticed, e.g.,
by FitzGerald in the case of percussive source separation [186, pp. 93-100].
To overcome the problem, he used an algorithm which processed separately
high-frequency bands which contain low-energy sources, such as hi-hats and
cymbals [187]. Vincent and Rodet [648] addressed the same problem. They
proposed a model in which the noise was additive in the log-spectral domain.
The numerical range of a logarithmic spectrum is compressed, which increases
the sensitivity to low-energy sources. Additive noise in the log-spectral domain
corresponds to multiplicative noise in power spectral domain, which was also
assumed in the system proposed by Abdallah and Plumbley [5]. Virtanen
proposed the use of perceptually motivated weights [651]. He used a weighted
cost function in which the observations were weighted so that the quantitative
significance of the signal within each critical band was equal to its contribution
to the total loudness.
X - Bg, (9.6)
The standard ICA requires that the number of observed variables K (the
number of sensors) be equal to the number of sources A^. In practice, the num-
ber of sensors can also be larger than the number of sources, because the vari-
ables are typically decorrelated using principal component analysis (PCA; see
Chapter 2), and if the desired number of sources is less than the number of
variables, only the principal components corresponding to the largest eigen-
values are selected.
As another pre-processing step, the observed variables are usually centred
by subtracting the mean and their variance is normalized to the unity. The
centred and whitened data observation vector x is obtained from the original
observation vector x by
x = V(x-/i), (9.7)
where fj, is the empirical mean of the observation vector, and V is a whitening
matrix, which is often obtained from the eigenvalue decomposition of the
empirical covariance matrix of the observations [304]. The empirical mean
and covariance matrix are explained in Chapter 2.
To simplify the notation, it is assumed that the data x in (9.6) is already
centred and decorrelated, so that K = N. The core ICA algorithm carries
out the estimation of an unmixing matrix W ~ B~^, assuming that B is
invertible. Independent components are obtained by multiplying the whitened
observations by the estimate of the unmixing matrix, to result in the source
vector estimate g:
g = Wx. (9.8)
The matrix W is estimated so that the output variables, i.e., the elements
of g, become maximally independent. There are several criteria and algo-
rithms for achieving this. The criteria, such as non-Gaussianity and mutual
information, are usually measured using high-order cumulants such as kurto-
sis, or expectations of other non-quadratic functions [304]. ICA can be also
viewed as an extension of PCA. The basic PCA decorrelates variables so that
they are independent up to second-order statistics. It can be shown that if
the variables are uncorrelated after taking a suitable non-linear function, the
higher-order statistics of the original variables are independent, too. Thus,
ICA can be viewed as a non-linear decorrelation method.
Compared with the previously presented linear model (9.3), the standard
ICA model (9.6) is exact, i.e., it does not contain the residual term. Some
special techniques can be used in the case of the noisy signal model (9.3),
but often noise is just considered as an additional source variable. Because of
the dimension reduction with PCA, E g gives an exact model for the PCA-
transformed observations but not necessarily for the original ones.
There are several ICA algorithms, and some implementations are freely
available, such as FastICA [302], [182] and JADE [65]. Computationally quite
efficient separation algorithms can be implemented based on FastICA, for
example.
276 Tuomas Virtanen
The above steps are explained in more detail below. Depending on the appli-
cation, not all of them may be necessary. For example, prior information can
be used to set the number of components in Step 2.
The basic ICA is not directly suitable for the separation of one-channel
signals, since the number of sensors has to be larger than or equal to the
number of sources. Short-time signal processing can be used in an attempt
to overcome this limitation. Taking a frequency transform such as DFT, each
frequency bin can be considered as a sensor which produces an observation in
each frame. With the standard linear ICA model (9.6), the signal is modelled
as a sum of components, each of which has a static spectrum (or some other
basis function) and a time-varying gain.
The spectrogram factorization has its motivation in invariant feature ex-
traction, which is a technique proposed by Kohonen [356]. The short-time
spectrum can be viewed as a set of features calculated from the input signal.
As discussed in Section 9.2.2, it is often desirable to have shift-invariant basis
^Singular value decomposition can also be used to estimate the number of com-
ponents [73].
9 Unsupervised Learning Methods for Source Separation 277
When magnitude or power spectrograms are used, the basis functions are
magnitude or power spectra which are non-negative by definition. Therefore,
^ICA aims at maximizing the independence of the output variables, but it cannot
guarantee their complete independence, as this depends also on the input signal.
278 Tuomas Virtanen
P([Gkt) = | e x p ( - / ( [ G ] , 0 ) - (9.11)
P(G) = n i e x p ( - / ( [ G W ) ) . (9.12)
n.t
It is obvious that in practice the gains are not independent of each other,
but this approximation is done to simplify the calculations. From the above
definitions we get
{[X]k,t - [BG]k,t)'
max p(B, G|X) ex max TT - = exp (
B,<^ B,Lr -^f- ay Z7T \ 2<T2
(9.13)
t,k
(9.15)
n,t
\nF=l^[Y]i (9.16)
^,J
where the parameters /i and a control the relative mass of the central peak in
the prior, and the term fi{l a) is used to make the function continuous at x =
/i. All these functions give a smaller cost and a higher prior probability for
gains near zero. The cost function f{x) = \x\ and the corresponding Laplacian
prior p(x) = | e x p ( - | x | ) are illustrated in Fig. 9.3. Systematic large-scale
evaluations of different sparse priors in audio signals have not been carried
out. Naturally, the distributions depend on source signals, and also on the
data representation.
Fig. 9.3. The cost function f{x) = \x\ (left) and the corresponding Laplacian prior
distribution p{x) = |exp(|a::|) (right). Values of G near zero are given a smaller
cost and a higher probability.
9 Unsupervised Learning Methods for Source Separation 281
Prom (9.15) and the above definitions of / , it can be seen that a sparse
representation is obtained by minimizing a cost function which is the weighted
sum of the reconstruction error term ||X BG||fr and the term which incurs
a penalty on non-zero elements of G. The variance a^ is used to balance
between these two. This objective 9.15 can be viewed as a penalized likelihood,
discussed in the Tools section (see Sections 2.2.9 and 2.3.3).
Typically, / increases monotonically as a function of the absolute value of
its argument. The presented objective requires that the scale of either the basis
functions or the gains is somehow fixed. Otherwise, the second term in (9.15)
could be minimized without affecting the first term by setting B ^ B ^ and
G ^ G/^, where the scalar 0 -^ oo. The scale of the basis functions can
be fixed for example with an additional constraint ||bn|| = 1, as done by
Hoyer [299], or the variance of the gains can be fixed.
The minimization problem (9.15) is usually solved using iterative algo-
rithms. If both B and G are unknown, the cost function may have several local
minima, and in practice reaching the global optimum in a limited time cannot
be guaranteed. Standard optimization techniques based on steepest descent,
covariant gradient, quasi-Newton, and active-set methods can be used. Differ-
ent algorithms and objectives are discussed for example by Kreutz-Delgado
et al. [373].
If B is fixed, more efficient optimization algorithms can be used. This
can be the case for example when B is learned in advance from training
material where sounds are presented in isolation. These methods are discussed
in Section 9.6.
No methods have been proposed for estimating the number of sparse com-
ponents in a monaural audio signal. Therefore, N has to be set either manu-
ally, using some prior information, or to a value which is clearly larger than
the expected number of sources. It is also possible to try different numbers of
components and to determine a suitable value of A^ from the outcome of the
trials.
As discussed in the previous section, non-negativity restrictions can be
used for frequency-domain basis functions. With a sparse prior and non-
negativity restrictions, one has to use the projected steepest descent algo-
rithms which are discussed, e.g., by Bertsekas in [35, pp. 203-224]. Hoyer
[299], [300] proposed a non-negative sparse coding algorithm by combining
NMF and sparse coding. His algorithm used a multiplicative rule to update
B, and projected steepest descent to update G. Projected steepest descent
alone is computationally inefl[icient compared to multiplicative update rules,
for example.
In musical signal analysis, sparse coding has been used for example
by Abdallah and Plumbley [4], [5] to produce an approximate piano-roll
transcription of synthesized harpsichord music and by Virtanen [650] to tran-
scribe drums in polyphonic music signals synthesized from MIDI. Also, Blu-
mensath and Davies used a sparse prior for the gains, even though their system
was based on a different signal model [43]. The framework also enables the use
282 Tuomas Virtanen
deuc(B,G) = | | X - B G | | 2 , (9.18)
and
ddiv(B, G) = Y1 D([X]fc.t, [BG]k,t), (9.19)
k,t
D{p,q)^p\og^-p + q. (9.20)
B ^ B.x(XG"^)./(BGG'^) (9.21)
and
G ^ G.X(B"^X)./(B'^BG), (9.22)
where .x and ./ denote the element-wise multiplication and division, respec-
tively. The update rules for the divergence are given as
and
G^G.^^&^. (9.24)
1. Initialize each entry of B and G with the absolute values of Gaussian noise.
2. Update G using either (9.22) or (9.24) depending on the chosen cost function.
3. Update B using either (9.21) or (9.23) depending on the chosen cost function.
4. Repeat Steps (2)--(3) until the values converge.
Methods for the estimation of the number of components have not been
proposed, but all the methods suggested in Section 9.4 are applicable in NMF,
too. The multiplicative update rules have proven to be more efficient than for
example the projected steepest-descent algorithms [400], [299], [5].
NMF can be used only for a non-negative observation matrix and therefore
it is not suitable for the separation of time-domain signals. However, when
used with the magnitude or power spectrogram, the basic NMF can be used
to separate components without prior information other than the element-
wise non-negativity. In particular, factorization of the magnitude spectrogram
using the divergence often produces relatively good results. The divergence
cost of an individual observation [X.]k,t is linear as a function of the scale of
the input, since D{ap^aq) = aD(p, g) for any positive scalar a, whereas for
the Euclidean cost the dependence is quadratic. Therefore, the divergence is
more sensitive to small-energy observations.
NMF does not explicitly aim at components which are statistically in-
dependent from each other. However, it has been proved that under certain
conditions, the non-negativity restrictions are theoretically sufficient for sep-
arating statistically independent sources [525]. It has not been investigated
whether musical signals fulfill these conditions, and whether NMF implement
284 Tuomeis Virtanen
Several methods have been proposed for training the basis functions in
advance. The most straightforward choice is to also separate the training
signal using some of the described methods. For example, Jang and Lee [314]
used ISA to train basis functions for two sources separately. Benaroya et al.
[32] suggested the use of non-negative sparse coding, but they also tested using
the spectra of random frames of the training signal as the basis functions or
grouping similar frames to obtain the basis functions. They reported that
non-negative sparse coding and the grouping algorithm produced the best
results [32]. Gautama and Van Halle compared three different self-organizing
methods in the training of basis functions [204].
The training can be done in a more supervised manner by using a sepa-
rate set of training samples for each basis function. For example in the drum
transcription systems proposed by FitzGerald et al. [188] and Paulus and
Virtanen [505], the basis function for each drum instrument was calculated
from isolated samples of each drum. It is also possible to generate the basis
functions manually, for example so that each of them corresponds to a single
pitch. Lepain used frequency-domain harmonic combs as the basis functions,
and parameterized the rough shape of the spectrum using a slope parameter
[403]. Sha and Saul trained the basis function for each discrete fundamental
frequency using a speech database with annotated pitch [579].
In practice, it is difficult to train basis functions for all the possible sources
beforehand. An alternative is to use trained or generated basis functions which
are then adapted to the observed data. For example, Abdallah and Plumbley
initialized their non-negative sparse coding algorithm with basis functions that
consisted of harmonic spectra with a quarter-tone pitch spacing [5]. After the
initialization, the algorithm was allowed to adapt these.
Once the basis functions have been trained, the observed input signal is
represented using them. Sparse coding and non-negative matrix factorization
techniques are feasible also in this task. Usually the reconstruction error be-
tween the input signal and the model is minimized while using a small number
of active basis functions (sparseness constraint). For example, Benaroya et al.
proposed an algorithm which minimizes the energy of the reconstruction error
while restricting the gains to be non-negative and sparse [32].
If the sparseness criterion is not used, a matrix G reaching the global
minimum of the reconstruction error can be usually found rather easily. If the
gains are allowed to have negative values and the estimation criterion is the
energy of the residual, the standard least-squares solution
G = (B"^B)-^B"^X (9.26)
produces the optimal gains (assuming that the previously trained basis func-
tions are linearly independent) [339, pp. 220-226]. If the gains are restricted
to non-negative values, the least-squares solution is obtained using the non-
negative least-squares algorithm [397, p. 161]. When the basis functions,
observations, and gains are restricted to non-negative values, the global min-
imum of the divergence (9.19) between the observations and the model can
286 Tuomas Virtanen
If the basis functions are estimated from a mixture signal, we do not know
which component is produced by which source. Since each source is modelled
as a sum of one or more components, we need to associate the components
to sources. There are roughly two ways to do this. In the unsupervised classi-
fication framework, component clusters are formed based on some similarity
measure, and these are interpreted as sources. Alternately, if prior informa-
tion about the sources is available, the components can be classified to sources
based on their distance to source models. Naturally, if pre-trained basis func-
tions are used for each source, the source of each basis function is known and
classification is not needed.
Pairwise dependence between the components can be used as a similarity
measure for clustering. Even in the case of ICA, which aims at maximizing the
independence of the components, some dependencies may remain because it
is possible that the input signal contains fewer independent components than
are to be separated.
9 Unsupervised Learning Methods for Source Separation 287
There are several different possibilities for the estimation of the funda-
mental frequency of a pitched component. For example, prominent peaks can
be located from the spectrum and the two-way mismatch procedure of Maher
and Beauchamp [428] can be used, or the fundamental period can be esti-
mated from the autocorrelation function which is obtained by inverse Fourier
transforming the power spectrum. In our experiments, the enhanced auto-
correlation function proposed by Tolonen and Karjalainen [627] was found to
produce good results (see p. 253 in this volume). In practice, a component may
represent more than one pitch. This happens especially when the pitches are
always present simultaneously, as is the case in a chord, for example. No meth-
ods have been proposed to detect this situation. Whether or not a component
is pitched can be estimated, e.g., from features based on the component [634],
[282].
Some systems use fixed basis functions which correspond to certain funda-
mental frequency values [403], [579]. In this case, the fundamental frequency
of each basis function is of course known.
9.7.3 Synthesis
minimized in the least-squares sense. The method can produce good synthesis
quaUty especially for slowly varying sources with deterministic phase behav-
iour. The least-squares criterion, however, gives less importance to low-energy
partials and often leads to a degraded high-frequency content. The phase gen-
eration problem has been recently addressed by Achan et ah, who proposed
a phase generation method based on a pre-trained autoregressive model [9].
As mentioned above, the linear model (9.1) is efficient in the analysis of music
signals since many musically meaningful entities can be rather well approxi-
mated with a fixed spectrum and a time-varying gain. However, representation
of sources with strongly time-varying spectrum requires several components,
and each fundamental frequency value produced by a pitched instrument has
to be represented with a different component. Instead of using multiple com-
ponents per source, more complex models can be constructed which allow
either a time-varying spectrum or a time-varying fundamental frequency for
each component. These are discussed in the following two subsections.
5000
- -
^4000
o '
3000 ' ~
o
2000
t ..,
^ 1000
" 1
E0.5T " 1.5 2 2.5
Time (seconds)
gn,t-gn,..X-^, (9.30)
bn 1
where 1 is a ivT-length vector of ones and denotes the correlation of vec-
tors, defined for real-valued vectors bn and y as g = b^ ^ y <^ Pz =
Sfc=i ^n,fc2/fc+25 z = 0 , . . . , Z. The update rule for the basis functions is
given as
9 Unsupervised Learning Methods for Source Separation 293
bn ^ b.x (9.31)
1. Initialize each gn,t and bn with the absolute values of Gaussian noise.
2. Calculate v* = ^ ^ ^ ^ bn * gn,t for each t=l...T.
3. Update each gn,t using (9.30).
4. Calculate Vt as in Step 2.
5. Update each bn using (9.31). Repeat Steps (2)-(5) until the values converge.
50
'ui V I 10
(D
C k
o
40 -
w
Hi
0)
c
(Q
0-30 . m-10
M i mm.
*;
x:
"D
(0
O -20
f, 20 p. Q.
E
3
V^S:K ap:> <
H-
ca -30
Z 10 -
CD v>
\r
CO
o
c
-40
13 ^
u- 0
-50
1 2 3 10 10
Time (seconds) Frequency (Hertz)
Fig. 9.5. Illustration of the time-varying gains (left) and the basis function (right)
of a component that was estimated from the example signal in Fig. 9.1 containing
a diatonic scale and a C major chord. On the left, the intensity of the image rep-
resents the value of the gain at each fundamental frequency shift and frame index.
Here the fundamental frequencies of the notes can be seen more clearly than from
the spectrogram of Fig. 9.1. The parameters were estimated using the algorithm
proposed in this section.
294 Tuomas Virtanen
results as those illustrated in Fig. 9.5. The model allows all the fundamental
frequencies within the range z = 0.. .Z tohe active simultaneously, thus, it is
not restrictive enough. For example, the algorithm may model a non-harmonic
drum spectrum by using a harmonic basis function shifted to multiple adjacent
fundamental frequencies. Ideally, this could be solved by restricting the gains
to be sparse, but the sparseness criterion complicates the optimization.
In principle, it is possible to combine time-varying spectra and time-
varying fundamental frequencies into the same model, but this further in-
creases the number of free parameters so that it can be difficult to obtain
good separation results.
When shifting the harmonic structure of the spectrum, the formant struc-
ture becomes shifted, too. Therefore, representing time-varying pitch by trans-
lating the basis function is appropriate only for nearby pitch values. It is
unlikely that the whole fundamental frequency range of an instrument could
be modelled by shifting a single basis function.
where s{t) is a reference signal of the source before mixing, and s{t) is the sep-
arated signal. In the separation of music signals, Jang and Lee [314] reported
average SDR of 9.6 dB for an ISA-based algorithm which trains basis func-
tions separately for each source. Helen and Virtanen [282] reported average
SDR of 6.4 dB for NMF in the separation of drums and polyphonic harmonic
track, and a clearly lower performance (SDR below 0 dB) for ISA.
In practice, quantitative evaluation of the separation quality requires that
reference signals, i.e., the original signals s{t) before mixing, be available. In
the case of real-world music signals, it is difficult to obtain the tracks of each
individual source instrument and, therefore, synthesized material is often used.
9 Unsupervised Learning Methods for Source Separation 295
Generating test signals for this purpose is not a trivial task. For example, ma-
terial generated using a software synthesizer may produce misleading results
for algorithms which learn structures from the data, since many synthesiz-
ers produces notes which are identical at each repetition. In the case that
source separation is a part of a music transcription system, quality evaluation
requires that audio signals with an accurate reference notation are available
(see Chapter 11, p. 355). Large-scale comparisons of different separation al-
gorithms for music transcription have not been made.
The algorithms presented in this chapter show that rather simple principles
can be used to learn and separate sources from music signals in an unsuper-
vised manner. Individual musical sounds can usually be modelled quite well
using a fixed spectrum with time-varying gain, which enables the use of ICA,
sparse coding, and NMF algorithms for their separation. Actually, all the al-
gorithms based on the linear model (9.4) can be viewed as performing matrix
factorization; the factorization criteria are just different.
The simplicity of the additive model makes it relatively easy to extend
and modify it, along with the presented algorithms. However, a challenge
with the presented methods is that it is difficult to incorporate some types of
restrictions for the sources. For example, it is difficult to restrict the sources
to be harmonic if they are learned from the mixture signal.
Compared to other approaches towards monaural sound source separa-
tion, the unsupervised methods discussed in this chapter enable a relatively
good separation qualityalthough it should be noted that the performance
in general is still very limited. A strength of the presented methods is their
scalability: the methods can be used for arbitrarily complex material. In the
case of simple monophonic signals, they can be used to separate individual
notes, and in complex polyphonic material, the algorithms can extract larger
repeating entities, such as chords. Some of the algorithms, for example NMF
using the magnitude spectrogram representation, are quite easy to imple-
ment. The computational complexity of the presented methods may restrict
their applicability if the number of components is large or the target signal is
long.
Large-scale evaluations of the described algorithms on real-world poly-
phonic music recordings have not been presented. Most published results use
a small set of test material and the results are not comparable with each
other. Although conclusive evaluation data are not available, a preliminary
experience from our simulations has been that NMF (or sparse coding with
non-negativity restrictions) often produces better results than ISA. It was
also noticed that prior information about sources can improve the separation
quality significantly. Incorporating higher-level models into the optimization
296 Tuomas Virtanen
algorithms is a big challenge, but will presumably lead to better results. Con-
trary to the general view held by most researchers less than 10 years ago,
unsupervised learning has proven to be applicable for the analysis of real-
world music signals, and the area is still developing rapidly.
Part IV
Kunio Kashino
10.1 Introduction
This chapter discusses work done in the area of music scene analysis (MSA).
Generally, scene analysis is viewed as the transformation of information from
a sensory input (physical entity) into concepts (psychological or perceptual
entities). Therefore, MSA is defined as a process that converts an audio signal
into musical concepts such as notes, chords, beats, and rhythms. Related tasks
include music transcription, pitch tracking, and beat tracking; however, this
chapter focuses on the auditory scene analysis (ASA) related aspect of this
process and does not explore the issues of pitch and beat tracking. An impor-
tant idea related to this is the distinction between physical and perceptual
sounds, as explained in Section 10.1.3 below.
We are exposed to various physical stimuli in our daily life. Our ears and eyes
receive acoustic and optical stimuli, respectively. These stimuli originate in
specific events or states. For example, when a ball hits a wall, vibrations in
both the ball and the wall resulting from the impact travel through the air,
and the air vibration arrives at our ears. We then understand that something
like a ball has hit a hard surface such as a wall.
Understanding physical stimuli is an everyday experience. However, it
poses an important question. An event such as a ball hitting a wall and causing
physical phenomena such as air vibration is a natural process. However, how
can we determine the events from the received physical phenomena? This is
an inversion of the natural process, and therefore the solution to this problem
is non-trivial.
Generally, a task that consists of recognizing an event or status from phys-
ical stimuli is called scene analysis, and scene analysis problems were first
investigated in the visual domain.
300 Kunio Kashino
Music is a good domain for considering the auditory scene analysis prob-
lem not only from a cognitive perspective [36] but also from an engineering
viewpoint. We use the term 'music scene analysis' to refer to auditory scene
analysis for music.
^The cocktail party problem refers to the task of following the discussion of one's
neighbours in a situation where lots of other sound sources are present, too.
10 Auditory Scene Analysis in Music Signals 301
The nature of music provides the first reason for considering it as an audi-
tory scene. The overlapping of tones is a fundamental element of music. With
the exception of a solo performance by a single instrument or voice, music
usually comprises multiple simultaneous sounds played by single or multiple
musical instruments. As humans we can appreciate such sound mixtures, im-
plying that we recognize what is happening to a certain extent. Only a trained
person is capable of transcribing music, but ordinary people can recognize a
vocal line or the principal accompaniment when they listen to pop music.
When we hear music played by a flute and a piano, it is easy to distinguish
the two instruments.
The second reason is the usefulness of the prospective applications. Cur-
rently, computers are not as good as humans at recognizing multiple simul-
taneous sounds. However, if computers are developed with this capability,
various useful systems will be realized including automatic music transcrip-
tion systems and automatic music indexing systems for unlabelled music
archives.
A research topic closely related to music scene analysis is automatic music
transcription [610] and pitch tracking. As already introduced in Chapter 1,
an automatic music transcription system was reported as long ago as the mid
1970s [521]. From then until the mid 1980s, several systems were built that
mainly targeted the transcription of monophonic melodies such as singing, or
simple polyphonic music such as guitar duets [483], [542]. The main method-
ology employed in such work involved signal processing techniques such as the
fast Fourier transform (FFT). This period can be considered as the pioneering
era of music transcription. Although the systems targeted rather simple com-
positions, various problems were identified such as frequency and temporal
fluctuations.
From the mid 1980s to the mid 1990s, the main target moved from mono-
phonic music to rather complicated polyphonic music, such as piano composi-
tions. With such signals, even determining the number of simultaneous notes
is a hard task let alone extracting fundamental frequencies for each note. To
overcome this problem, researchers pointed out that knowledge is required
for such transcription [470], [81], [80]. The main methodology consisted of in-
tegrating symbolic knowledge and signal processing. For example, Katayose
et al. built a rule-based automatic music transcription system for multiple
simultaneous-note performances by a single instrument [335]. The system com-
prises a control module, a processing module, and a music analysis module.
The control module is an inference engine performing rule-based reasoning
and invoking the processing module that extracts fundamental frequencies
and beat times. The music analysis module analyses musical characteristics
such as melody, rhythm, chord transitions, and keys, and then its results are
fed back to the control and processing modules. This type of approach to some
degree parallels the methodology of visual scene analysis in use at that time
as mentioned above. This period can be viewed as the system-oriented era of
music transcription.
302 Kunio Kashino
At that time, in the artificial inteUigence field, it was recognized that there
was a bottleneck as regards knowledge acquisition for rule-based inference
engines. This means that it was diflicult to prepare all the knowledge required
for these systems, and the systems tended to fail to work properly when they
faced a situation not dealt with by the installed knowledge.
Since the mid 1990s, a lot of research has targeted polyphonic music played
by multiple musical instruments in, for example, orchestras or commercial pop
music [611], [350], [353]. Music played on multiple instruments is even harder
to transcribe than music played on a single instrument, mainly because there
are fewer constraints on the instruments or because we have less prior knowl-
edge. That is, we must estimate the most plausible transcription under highly
ambiguous and uncertain situations. Researchers noticed that it is hard to
deal with such problems solely by the rule-based approach, and this require-
ment has naturally inspired researchers to apply probabilistic modelling. We
call this period the model-oriented era of music transcription.
As is widely recognized, music usually has simultaneous and temporally hi-
erarchical structures. For example, multiple simultaneous notes form a chord,
and a sequence of notes across multiple bars form a phrase. Such structures
are naturally incorporated in probabilistic models.
This is the point that distinguishes music transcription from music scene
analysis. The object of music transcription is to create a score from a musical
audio signal. On the other hand, the object of music scene analysis is to
recover hierarchical structures and describe the auditory entities encoded in
the structures from a musical audio signal. The recognition of structures is
not a prerequisite for creating a score. However, a score can be produced once
a complete music structure has been obtained. Prom this viewpoint, music
transcription can be a specific instance of music scene analysis.
The present author used the term music scene analysis in this sense, and
proposed a music scene analysis system based on a probabilistic model in the
mid 1990s [332]. We defined the problem as the estimation of the posterior
probability distribution given an input audio signal and a set of prior knowl-
edge encoded in internal models. Recently, other probabilistic approaches to-
ward music analysis tasks have been emerging. For example. Goto highlighted
a sub-symbolic^ aspect in music scene analysis and specifically termed it music
scene description [228]. As discussed in Chapter 11, his descriptions of, for ex-
ample, predominant fundamental frequencies and beat times correspond to a
primal level of a music scene representation structure, which is very important
to address.
piano E
component
^ ^ : perceptual sound
For a music scene analysis task, we should use multiple constraints or pieces
of information stored in advance. For example, in Western tonal music, a
sequence of chords does not appear at random but exhibits certain statistical
characteristics, and multiple frequency components whose frequencies have a
harmonic relationship tend to arise from 'one sound'.
Thus, the problem can be formulated as an a posteriori estimation. Let
H he 8i set of random variables corresponding to internal states or perceptual
sounds to be modelled, and the observation be x. Then, the task is generally
written as
H = argmaxP(i7|x) = arg max F(x | if )P(i^). (10.1)
H H
This is a Bayesian estimation of posteriors. Since it is very hard to calculate
this in a general form, we must impose a structure on H. That is, when
some elements of H can be considered to be independent, the calculations can
take advantage of that fact. This point will be discussed later in this chapter.
As mentioned above, researchers have tried to incorporate the clues for human
auditory scene analysis to sound source separation. It is also important to
consider such psychological findings in music scene analysis tasks, because a
perceptual sound is a psychological entity rather than a physical entity.
The bottom-up process of music scene analysis can be considered to be a
clustering of frequency components. The idea of clustering based on such cues
as harmonicity and synchrony has been employed by many authors for the
bottom-up processing of auditory scene analysis methods [53], [460], [171].
To formulate the clustering, it is important to explore quantitative dis-
tance measures. There are at least two approaches to this problem: one is
308 Kunio Kashino
I mrWm^
w^ ^ JJf ^
Fig. 10.2. Overlapping components in a composition: x indicates the notes whose
fundamental frequencies are an integral fraction of those of the notes in the other
part. The number shows the ratio of these fundamental frequencies.
Distance(V)
To be Used in the
Clustering for
Sound Formation
Frequency
Components Integration
Evaluation of Features
Fig. 10.3. The evaluation-integration model of sound formation.
Two simultaneous frequency components are heard as one sound when they
are harmonically related, but they tend to be heard as separate sounds when
they are mistuned [476], [277]. In the first experiment, three subjects were
presented with stimuli with various degree of mistuning in a random order
and asked to choose whether each stimulus was one sound or two sounds.
Then, a linear model that measures segregation probability was determined
by the least squares fitting of the experimental results. The probability of
segregation Ch was given by
u, p_ < li < 0 ,
P-
Ch{u) = < ^ n^ / (10-3)
1 otherwise.
where / i and /2 (/i < /2) are the frequencies of the components.
An experimental summary is shown in Fig. 10.4. The horizontal axis n in
Fig. 10.4 is given by
^,|log/2-log(2/.)|
log 1.005 ^ ^
No significant difference was found for p^ and p- and they are therefore not
distinguished in Fig. 10.4.
Two harmonic frequency components are heard as one sound when they
start simultaneously, but they tend to be heard as separate sounds when they
are asynchronous. In the second experiment, the segregation probability was
measured as a function of onset time/gradient difference, and a linear model
was obtained by least squares fitting. The probability of segregation CQ was
given by the following equation.
1
"^ (10.6)
otherwise.
5>0.
Here, S is the area of the region surrounded by the amplitude envelopes of two
frequency components projected onto the time-amplitude plane, as shown in
Fig. 10.5. In this experiment, the amplitudes of frequency components after
the onset part were chosen to be the same. The parameter Sp is given by
a b
Sp^-^ + - +c, (10.7)
/i 9i
310 Kunio Kashino
Separation p = 2.60 o.53[%] p = 2.60 0.53[%]
[%]
100
80
60
40
20
3 2 1
Downward Mistuning
... Approxinnation
Amplitude
[dB] t
time
Fig. 10.5. An onset asynchrony model.
where / i and gi are the frequency and the onset gradient of the earlier fre-
quency component, respectively. The parameters a, 6, and c were obtained by
regression analysis, leading to the values a = 250, b = 1.11, and c = 0.317. An
example of the relation between the model and experimental results is shown
in Fig. 10.6.
The two kinds of feature evaluations are then integrated by
100
\ =
J
80
,
60
V
\i
40
\\ /
20 \
1/
0
-100 -50 0 50 100 [ms]
Relative Onset Time
of 400Hz Component
200Hz (5dB/ms) + 400Hz(ldB/ms)
Approximation
A Bayesian network is a directed acyclic graph (DAG) where the nodes cor-
respond to random variables and the links between the nodes encode proba-
bilistic dependences between corresponding random variables [509], [205].
A random variable corresponds to an event to be modelled. A directed link,
represented by an arrow, shows the direction of the probabilistic dependency.
The origin of the arrow is called the parent and the end point is called the child.
Each link can encode conditional probabilities, which are the probabilities of
child events given the parent events. The word acyclic means that there is no
route from any node that returns to the original node as long as the links are
followed in the designated direction. Such a graph can be singly or multiply
connected. Singly connected means there is only one path between any two
nodes in the graph. Otherwise the graph is referred to as multiply connected.
When a graph is directed and singly connected, it is called a tree if none of
the nodes has more than one parent.
The objective of considering such a network is to calculate posteriors effi-
ciently after some of the random variables have been fixed or observed. The
absence of a link between two nodes means that there is no direct relationship
between the corresponding random variables. Even if these nodes are linked
indirectly (i.e., via other nodes), they are independent when at least one node
existing between the two is fixed. Posterior calculation on the Bayesian net-
work takes advantage of this property.
Figure 10.7 shows an example of a Bayesian network. If the network is
singly connected as in the figure, the posterior calculation is straightforward.
As an example, assume we wish to find the posterior probabilities induced at
node B. Letting D^ represent the data contained in the tree rooted at B and
D^ for the data contained in the rest of the network, we have
= (3l[Xk{B), (10.15)
k
P{D''-\B)=Y^P{D-E\B,ei)P{ei\B) (10.16)
i
=Y,P{D-E\ei)P{ei\B) (10.17)
i
=Y,Kei)P{ei\B), (10.18)
i
and therefore, using conditional probabilities of children given the parent such
as P{ei\hj) and (10.15), we obtain A's node by node. Here, P{ei\hj) is to be
obtained from a statistical model or data, that is learned by or provided to
the system.
Now we consider 7r(5). We have
i^{B)=P{B\D%) (10.19)
=Y,P{B\ai,Dl)P{ai\Dl) (10.20)
i
P(A|D+) = _ M _ . (10.23)
316 Kunio Kashino
Here, we review how the Bayesian networks were applied to a music scene
analysis task. The first example is a processing model called Organized
Processing Toward Integrated Music Scene Analysis (OPTIMA) [332]. The
input of the model is assumed to be monaural music signals. The output is a
music scene description, that is, a hierarchical representation of musical events
such as frequency components, notes and chords. As shown in Fig. 10.8, the
model consists of three blocks: (A) a pre-processing block, (B) a main process-
ing block, and (C) internal models.
In the pre-processing block, frequency analysis is performed and a sound
spectrogram is obtained. Then, frequency components are extracted. An ex-
ample of the power transition of a frequency component is shown in Fig. 10.9.
With complicated spectrum patterns, it is difficult to recognize the onset and
offset times solely based on bottom-up information. Thus the system creates
several terminal point candidates for each extracted component.
Rosenthal's rhythm recognition method [547] and Desain's time quanti-
zation method [140] are used to obtain rhythm information for the precise
extraction of frequency components and recognition of the onset/offset time.
Chord Transition
Prediction
] C Chord Group
Creation
Chords
Chord Analysis
mS2L
Musical Notes ^
Note Prediction
Clustering for
Source Identification
Clustering for
IXIXIXI Freq.Component
Prediction
Tone Memories
Sound Formation
Freq. Components
Freq.Component
Extraction
Processing Window
Creator
K Rhythm / Beat
Analysis
J
Time-Freq. Analysis
0 Output Data
Generation
) (
Preprocesses
7^
Monaural Music Signals
loWwimramW
Resynthesized
Sound
M\D\ Data Display
i
^ 1
o Frequency component
Lo ^o o A / W V A A
(r o o Q 0 O O P
L
0 1 2 3 Time[s]
O Termination candidate point
0 1 2 3 4 5 Time[s]
Fig. 10.10. A spectrogram of a polyphonic music excerpt and the processing win-
dows.
onset times. The processing window is utilized as a time base for the subse-
quent main processes.
When each processing window is created in the pre-processing block, it
is passed to the main processing block. The main block involves a Bayesian
network. Exactly as discussed in the previous section, the Bayesian network
has three layers: component, note, and chord levels. The chord level nodes
are connected in time as the time proceeds. Each node in the network is a
random variable that encodes multiple hypotheses. That is, the model holds
hypotheses of the external acoustic events as a probability distribution in a
hierarchical space.
The Bayesian network is actually built by multiple processing modules.
The modules are classified into two types: those for creating the nodes and
providing initial probabilities to the nodes, and those for providing condi-
tional probabilities to the links. The former is called creators, and the latter
predictors.
There are two creators in OPTIMA: a note hypothesis creator and a chord
hypothesis creator. As described above, first, frequency component hypothe-
ses and processing windows are created. Then the note hypothesis creator
generates the hypotheses by referring to perceptual rules such as harmonic
mistuning and onset asynchrony as described in the previous section. The
creator also consults timbre models for a timbre discrimination analysis to
identify the sound source of each note. A chord hypothesis creator generates
the chord hypotheses when note hypotheses are given. This creator refers to
chord naming rules.
10 Auditory Scene Analysis in Music Signals 319
Table 10.3. Examples of the chord-note relation knowledge. The conditional prob-
abilities P(note|chord) were obtained by statistical analysis of printed music.
chord
note
component
Q : node ^ : link
Consider two musical notes n^, Uk-i {k denotes the order of the onset
times of these notes, Uk-i preceding n^). An 'impedance' measure Z(n/e, n^-i)
is defined as
W{6t)=exp(-^ , (10.25)
where St is the difference between the onset times of these two notes, and r
is a time constant. Unlike ordinary time windows, W becomes greater as 5t
increases. This loosely corresponds to the proximity rule of auditory stream
organization as described in [49].
In this example, the following three Z factors are considered: (Pi) the
transition probabilities of musical intervals, (P2) the transition probabilities
of timbres, and (P3) the transition probabilities of musical roles.
The first factor is the musical interval probability. In tonal music, the
musical intervals of note transitions do not appear equally often; some inter-
vals are more frequent than others. Thus the pitch transition probability in a
melody can be utihzed as Pi in (10.24). The probabilities Pi were obtained
from 397 melodies extracted from 196 pop scores and 201 jazz scores, where
the total number of note transitions was 62,689. Figure 10.12 shows the es-
timated probabilities. The analysis was made only for the principal melodies
and may not be precisely valid for the other melodies such as bass lines or
parts arranged for polyphonic instruments such as the piano. For simplicity,
however, probabilities shown in Fig. 10.12 for Pi were used for all cases.
The second factor is the timbre transition probability. It is reasonable to
suppose that a sequence of notes tends to be composed of notes that have
similar timbres. To incorporate this tendency, a distance measure was defined
between the timbres of two notes, so as to estimate the probability that two
notes a certain distance apart would appear sequentially in a musical stream.
These probabilities form P2 in (10.24).
The distance between timbres is defined as the Euclidean distance between
the timbre vectors in a timbre space. A timbre space can be spanned in several
ways. In the experiment described in [331], each axis of the space corresponds
to a musical instrument name, and a timbre vector is composed of correlation
322 Kunio Kashino
Transition
Probability [%]
20
14 12 10 4 6 8 10 12 14
H + Interval (semitone)
Probability
Distance
values between the input signal and each of the template signals of musical
instruments stored in advance. Then, the distances between the timbre vectors
of successive notes in a sequence are translated into probabilities using a
normalized histogram as explained in Fig. 10.13. This histogram models the
distribution of the timbre vectors for successive notes.
The third factor is musical role consistency. In ensemble music, a sequence
of notes can be regarded as carrying a musical role such as a principal melody
or a bass line. To introduce such musical semantics, the probability P3 is
introduced:
Ps = ar-^b, (10.26)
where a and b are constants, and r is the rate of the highest (or lowest) notes in
the music stream under consideration. Equation (10.26) represents a musical
heuristic that the music stream formed by the highest (lowest) notes tends to
continue to flow to the highest (lowest) note.
10 Auditory Scene Analysis in Music Signals 323
> l Step 2
\-3 a "k
-^^ .^?^^
^ ;
n\"^ ^
vfl5 o_ %V^^^.
Step 3
Fig. 10.14. A procedure for creating music stream networks. See text for details.
2. The system then evaluates Z values for the link candidates {gir'' ^ds)
from the selected node {uks), to choose the hnk with the minimum Z
value (^i).
3. If ^1 and /i are identical, the link composes a music stream. If a music
stream from riks has already been formed in a direction other than pi,
the stream is cut; the direction of the music stream is changed to gi{= h).
Thus the networks are built by connecting nodes that give the locally mini-
mum Z value.
Once the network has been built, then it can be considered as a Bayesian
network and the posterior probabilities of sound sources are calculated. An
example of the system in operation is shown in Figure 10.15. The input here
is a monaural recording of a real ensemble performance of 'Auld Lang Syne',
a Scottish folk song, arranged in thee parts and performed by a violin, a flute,
and a piano. Figure 10.15 displays the recognized music streams as well as
the status of nodes for the beginning part of the song. The bars in each node
indicate the probabilities at the node (not normalized). The links between the
nodes are the extracted music streams. It is shown that each part is correctly
recognized as the music stream. The thickness of the link line corresponds to
its Z value given by (10.24); a thick line represents a link with a low Z value.
324 Kunio Kashino
I H
Fkitei
PfanoT I
/ \ _A /
1-S5| \ I -SS J
g
I
-S 1 1 10-53
^ A 1
rt:^
Piano
Masataka Goto
11.1 Introduction
psychology, it has been pointed out that human hsteners do not perform
sound source separation: perceptual sound source segregation is differ-
ent from signal-level separation. For example, Bregman noted that 'there
is evidence that the human brain does not completely separate sounds'
[50]. The approach of developing methods for monaural or binaural sound
source separation might deal with a hard problem which is not solved by
any mechanism in this world (not solved even by the human brain).
It is possible to understand music without complete music transcription.
Music transcription, identifying the names (symbols) of musical notes, is a
difficult skill mastered only by trained musicians. As pointed out by Goto
[239], [240], [232] and Scheirer [565], untrained listeners understand music
to some extent without mentally representing audio signals as musical
scores. For example, as known from the observation that a listener who
cannot identify the name and constituent notes of a chord can nevertheless
feel the harmony and chord changes, a chord is perceived as combined
whole sounds (tone colour) without reducing it to its constituent notes
(like reductionism). Furthermore, even if it is possible to derive separated
signals and musical notes, it would still be difficult to obtain high-level
music descriptions like melody lines and chorus sections.
The music scene description approach therefore emphasizes methods that
can obtain a certain description of a music scene from sound mixtures of
various musical instruments in a musical piece. Here, it is important to discuss
what constitutes an appropriate description of music signals. Since various
levels of abstraction for the description are possible, it is necessary to consider
which level is an appropriate first step towards the ultimate description in
human brains. Goto [232], [228] proposed the following three viewpoints:
An intuitive description that can be easily obtained by untrained listeners.
A basic description that trained musicians can use as a basis for higher-
level music understanding.
A useful description facilitating the development of various practical ap-
plications.
11 Music Scene Description 329
The estimation of melody and bass lines is important because the melody
forms the core of Western music and is very influential in the identity of a
musical piece, while the bass is closely related to the tonality (see Chapter 1).
These lines are fundamental to the perception of music by both musically
trained and untrained listeners. They are also useful in various apphcations
such as automatic music indexing for information retrieval (e.g., search-
ing for a song by singing a melody), computer participation in live human
performances, musical performance analysis of outstanding recorded perfor-
mances, and automatic production of accompaniment tracks for karaoke using
CDs.
It is difficult to estimate the fundamental frequency (FO) of melody and
bass lines in monaural sound mixtures from CD recordings. Most previ-
ous FO estimation methods cannot be applied to this estimation because
they require that the input audio signal contain just a single-pitch sound
with aperiodic noise or that the number of simultaneous sounds be known
beforehand. The main reason FO estimation in sound mixtures is difficult
is that, in the time-frequency domain, the frequency components of one
sound often overlap the frequency components of simultaneous sounds. In
popular music, for example, part of the voice's harmonic structure is often
overlapped by harmonics (overtone partials) of the keyboard instrument
or guitar, by higher harmonics of the bass guitar, and by noisy inhar-
monic frequency components of the snare drum. A simple method for lo-
cally tracing a frequency component is therefore neither reliable nor stable.
Moreover, FO estimation methods relying on the existence of the FOs fre-
quency component (the frequency component corresponding to the FO) not
only cannot handle the missing fundamental, but are also unreliable when
the FOs frequency component is smeared by the harmonics of simultaneous
sounds.
FO estimation of melody and bass lines in CD recordings was first achieved
in 1999 by Goto [232], [222], [228]. Goto proposed a real-time method called
PreFEst (Predominant-FO Estimation method) which estimates the melody
and bass lines in monaural sound mixtures. Unlike previous FO estimation
methods, PreFEst does not assume the number of sound sources, locally trace
frequency components, or even rely on the existence of the FOs frequency
component. PreFEst basically estimates the FO of the most predominant har-
monic structurethe most predominant FO corresponding to the melody or
bass linewithin an intentionally limited frequency range of the input mix-
ture. It simultaneously takes into consideration all possibilities for the FO and
treats the input mixture as if it contained all possible harmonic structures
with different weights (amplitudes). To enable the application of statistical
methods, the input frequency components are represented as a probability
density function (pdf), called an observed pdf. The point is that the method
regards the observed pdf as a weighted mixture of harmonic-structure tone
11 Music Scene Description 331
I Audio signals
3 E
Limiting frequency regions
BPF for melody line] ( BPF for bass line
I Frequency components
fundamental
frequency
Vn+1',__
Vo+240(r^ V [cent]
Fig. 11.3. Model parameters of multiple adaptive tone models p(i^|i/o, i, M (^o, i))-
Weighted-Mixture Model of Adaptive Tone Models
To deal with diversity of the harmonic structure, the PreFEst core can use
several types of harmonic-structure tone models. The pdf of the i-th tone
model for each FO UQ is denoted by p(i/|i/o,^,M^*H^o,0) (^^^ ^ig. 11.3), where
the model parameter ^^^\VQ^I) represents the shape of the tone model. The
number of tone models is I^^ (that is, i = l , . . . , I i t ) , where u denotes the
melody line {u = 'melody') or the bass line (u = 'bass line'). Each tone model
is defined by
Mu
^c(*)(m|i/o,i) = l. (11-7)
771=1
In short, this tone model places a weighted Gaussian distribution at the po-
sition of each harmonic component.
The PreFEst core then considers the observed pdf Px^ {y) to have been
generated from the following model p(i/|0^*^), which is a weighted mixture of
all possible tone models p[v\yQ,i^ii^^\vo^i)):
ph I^
where F\^ and FJ^ denote the lower and upper Hmits of the possible (allowable)
FO range and w^^\iyo, i) is the weight of a tone model p(i/|z^o, h I^^^H^o^ ^)) that
satisfies
L ^'Tw^'Huo.i)diyo = l- (11.12)
To use prior knowledge about FO estimates and the tone model shapes, a prior
distribution pou(^ ) of ^^^^ is defined as follows:
.1 ,1 illIhA,...
LJ ij
frequency [cent]
1 1,1 IIIIIAU...,
I1
frequency [cent]
2000
l.l
2500
I
3000 3500 4000 4500
1,l,,
frequency [cent]
5000 5500 6000 2000 2500 3000 3500 4000 4500
frequency [cent]
A
5000
,
5500 6000
Mu
Col{'m\vo,i)
D^(Mo('^o,i);M^*'K,i)) = E 4 ( " ^ k o , i ) log (11.18)
m=l c(*)(m|i/Oii)
These prior distributions were originally introduced for the sake of analyt-
ical tractability of the expectation maximization (EM) algorithm to obtain
intuitive (11.25) and (11.26).
/ -00
(11.21)
where Q{6^^^ \0'^^^) is the conditional expectation of the mean log-likelihood
for the maximum likehhood estimation. Ei^Q,i,rn[<^|^] denotes the condi-
tional expectation of a with respect to the hidden variables Z/Q, h ^^^ ^?
with the probability distribution determined by condition b.
2. M-step:
Maximize QMAP(^ |^ ) as a function of 6^*^ to obtain an updated (im-
proved) estimate 9 :
oo ft ^u ^y^u
1 1 PlVU
(11.28)
After the above iterative computation of (11.25) and (11.26),^ the FOs pdf
PFO(^O) ^^^ ^^ obtained from w^^\uo^ i) according to (11.13). The tone model
shape c^^\m\iyo^i), which is the relative amplitude of each harmonic compo-
nent of all types of tone models p{iy\iyo,i, fi^^\uo, i)), can also be obtained.
7200 cent
melody line g
3600 cent
bass line.
Ocent
^ time [sec] ^
(a) Frequency components (b) Estimated melody and bass lines
(observed pdf p^^ (u) before (the most dominant and stable
applying bandpass filters) FO trajectory in each p}^o(^o))
(c) FOs pdf (p^o(^o)) for estimating (d) FOs pdf (pi^o(^o)) ^^^ estimating the
the melody line in (b) bass line in (b)
Fig. 11.5. Audio-synchronized real-time graphics output for a popular music ex-
cerpt with drum sounds: (a) frequency components, (b) the corresponding melody
and bass lines estimated (final output), (c) the corresponding FOs pdf obtained when
estimating the melody line, and (d) the corresponding FOs pdf obtained when es-
timating the bass line. These interlocking windows have the same vertical axis of
log-scale frequency.
11.2.4 O t h e r M e t h o d s
While the PreFEst method resulted from pioneering research regarding melody
and bass estimation and weighted-mixture modelling for FO estimation, many
issues still need to be resolved. For example, if an application requires MIDI-
level note sequences of the melody line, the FO trajectory should be segmented
and organized into notes. Note that the PreFEst method does not deal with
the problem of detecting the absence of melody and bass lines: it simply out-
puts the predominant FO for every frame. In addition, since the melody and
bass lines are generated from a process that is statistically biased rather than
randomi.e., their transitions are musically appropriate this bias can also be
incorporated into their estimation. This section introduces other recent ap-
proaches [494], [493], [435], [436], [169] that deal with these issues in describing
polyphonic audio signals.
Paiva, Mendes, and Cardoso [494], [493] proposed a method of obtaining
the melody note sequence by using a model of the human auditory system
[595] as a frequency-analysis front end and applying MIDI-level note track-
ing, segmentation, and elimination techniques. Although the techniques used
differ from the PreFEst method, the basic idea that 'the melody generally
clearly stands out of the background' is the same as the basic PreFEst con-
cept that the FO of the most predominant harmonic structure is considered the
melody. The advantage of this method is that MIDI-level note sequences of the
melody line are generated, while the output of PreFEst is a simple temporal
trajectory of the FO. The method first estimates predominant FO candidates
by using correlograms (see Chapter 8) that represent the periodicities in a
cochleagram (auditory nerve responses of an ear model). It then forms the
temporal trajectories of FO candidates: it quantizes their frequencies to the
closest MIDI note numbers and then tracks them according to their frequency
proximity, where only one-semitone transition is considered continuous. After
this tracking, FO trajectories are segmented into MIDI-level note candidates
by finding a sufficiently long trajectory having the same note number and by
dividing it at clear local minima of its amplitude envelope. Because there still
remain many inappropriate notes, it eliminates notes whose amplitude is too
low, whose duration is too short, or which have harmonically related FOs and
almost same onset and offset times. Finally, the melody note sequence is ob-
tained by selecting the most predominant notes according to heuristic rules.
Since simultaneous notes are not allowed, the method eliminates simultaneous
notes that are less dominant and not in a middle frequency range.
Marolt [435], [436] proposed a method of estimating the melody line by
representing it as a set of short vocal fragments of FO trajectories. This method
is based on the PreFEst method with some modifications: it uses the PreFEst
core to estimate predominant FO candidates, but uses a spectral modelling
synthesis (SMS) front end that performs the sinusoidal modelling and analysis
(see Chapters 1 and 3) instead of the PreFEst front end. The advantage of
this method is that the FO candidates are tracked and grouped into melodic
340 Masataka Goto
fragments (reasonably segmented signal regions that exhibit strong and stable
FO) and these fragments are then clustered into the melody line. The method
first tracks temporal trajectories of the FO candidates (salient peaks) to form
the melodic fragments by using a salient peak tracking approach similar to
the PreFEst back end (though it does not use multiple agents). Because the
fragments belong to not only the melody (lead vocal), but also to different
parts of the accompaniment, they are clustered to find the melody cluster by
using Gaussian mixture models (GMMs) according to their five properties:
Dominance (average weight of a tone model estimated by the EM algo-
rithm) ,
Pitch (centroid of the FOs within the fragment),
Loudness (average loudness of harmonics belonging to the fragment),
Pitch stability (average change of FOs during the fragment), and
Onset steepness (steepness of overall loudness change during the first 50
ms of the fragment).
Eggink and Brown [169] proposed a method of estimating the melody
line with the emphasis on using various knowledge sources, such as knowl-
edge about instrument pitch ranges and interval transitions, to choose the
most likely succession of FOs as the melody line. Unlike other methods, this
method is specialized for a classical sonata or concerto, where a solo melody
instrument can span the whole pitch range, ranging from the low tones of a
cello to a high-pitched flute, so the frequency range limitation used in the
PreFEst method is not feasible. In addition, because the solo instrument does
not always have the most predominant FO, additional knowledge sources are
necessary to extract the melody line. The main advantage of this method is the
leverage provided by knowledge sources, including local knowledge about an
instrument recognition module and temporal knowledge about tone durations
and interval transitions, which are integrated in a probabilistic search. Those
sources can both help to choose the correct FO among multiple concurrent FO
candidates and to determine sections where the solo instrument is actually
present. The knowledge sources consist of two categories, local knowledge and
temporal knowledge. The local knowledge concerning FO candidates obtained
by picking peaks in the spectrum includes
FO strength (the stronger the spectral peak, the higher its likelihood of
being the melody),
Instrument-dependent FO likelihood (the likehhood values of an FO candi-
date in terms of its frequency and the pitch range of each solo instrument,
which are evaluated by counting the frequency of its FO occurrence in
different standard MIDI files), and
Instrument likelihood (the likelihood values of an FO candidate being pro-
duced by each solo instrument, which are evaluated by the instrument
recognition module).
11 Music Scene Description 341
Muraoka [235], [220], [221]. The estimation of the hierarchical beat structure,
especially the measure (bar line) level, requires the use of musical knowledge
about drum patterns and chord changes; on the other hand, drum patterns and
chord changes are difficult to estimate without referring to the beat structure
of the beat level (quarter note level). The system addresses this issue by lever-
aging the integration of top-down and bottom-up processes (Fig. 11.6) under
the assumption that the time signature of an input song is 4/4. The system
first obtains multiple possible hypotheses of provisional beat times (quarter-
note-level beat structure) on the basis of onset times without using musical
knowledge about drum patterns and chord changes. Because the onset times
of the sounds of bass drum and snare drum can be detected by a bottom-up
frequency analysis described in Section 5.2.3, p. 137, the system makes use
of the provisional beat times as top-down information to form the detected
onset times into drum patterns whose grid is aligned with the beat times. The
system also makes use of the provisional beat times to detect chord changes
in a frequency spectrum without identifying musical notes or chords by name.
The frequency spectrum is sliced into strips at the beat times and the domi-
nant frequencies of each strip are estimated by using a histogram of frequency
components in the strip [240]. Chords are considered to be changed when the
dominant frequencies change between adjacent strips. After the drum patterns
and chord changes are obtained, the higher-level beat structure, such as the
measure level, can be estimated by using musical knowledge regarding them.
Beat Structure^^^^
Musical Decision
(higher level, reliability)
Beat Prediction
(quarter-note level)
^nset-time Vector
Onset-time
Vectonzer
Fig. 11.6. Synergy between the estimation of the hierarchical beat structure, drum
patterns, and chord changes. Drum patterns and chord changes are obtained, at
'Higher Analysis' in the figure, by using provisional beat times as top-down infor-
mation. The hierarchical beat structure is then estimated, at 'Musical Decision'
in the figure, by using the drum patterns and chord changes. A drum pattern is
represented by the temporal pattern of a bass drum (BD) and a snare drum (SD).
of an image thumbnail) to find a desired song. It can also provide novel music
listening interfaces for end users as described in Section 11.7.3.
To detect chorus sections, typical approaches do not rely on prior informa-
tion regarding acoustic features unique to choruses but focus on the fact that
chorus sections are usually the most repeated sections of a song. They thus
adopt the following basic strategy: detect similar sections that repeat within a
musical piece (such as a repeating phrase) and output those that appear most
often. On entering the 2000s, this strategy has led to methods for extracting a
single segment from several chorus sections by detecting a repeated section of
a designated length as the most representative part of a musical piece [417],
[27], [103]; methods for segmenting music, discovering repeated structures,
or summarizing a musical piece through bottom-up analyses without assum-
ing the output segment length [110], [111], [512], [516], [23], [195], [104], [82],
[664], [420]; and a method for exhaustively detecting all chorus sections by
determining the start and end points of every chorus section [224].
Although this basic strategy of finding sections that repeat most often is
simple and effective, it is diflftcult for a computer to judge repetition because it
is rare for repeated sections to be exactly the same. The following summarizes
the main problems that must be addressed in finding music repetition and
determining chorus sections.
344 Masataka Goto
^Masataka Goto's survey of Japan's popular music hit chart (top 20 singles
ranked weekly from 2000 to 2003) showed that modulation occurred in chorus rep-
etitions in 152 songs (10.3%) out of 1481.
11 Music Scene Description 345
The following acoustic features, which capture pitch and timbral features of
audio signals in different ways, were used in various methods: chroma vectors
[224], [27], [110], [111], mel-frequency cepstral coefficients (MFCC) [417], [103],
[23], [195], [104], (dimension-reduced) spectral coefficients [103], [195], [104],
[82], [664], pitch representations using FO estimation or constant-Q filterbanks
[110], [111], [82], [420], and dynamic features obtained by supervised learning
[512], [516].
346 Masataka Goto
Audio
^ signals
^ O
o
[ Extracting 12-dimensional chroma vectorsj D
CO
Q!
0
CD
[integrating repeated sections with modulation]
Fig. 11.7. Overview of RefraiD (Refrain Detection method) for detecting all chorus
sections with their start and end points while considering modulations (key changes).
Fig. 11.8. An example of chorus sections and repeated sections detected by the
RefraiD method. The horizontal axis is the time axis (in seconds) covering the
entire song. The upper window shows the power. On each row in the lower window,
coloured sections indicate similar (repeated) sections. The top row shows the list
of the detected chorus sections, which were correct for this song (RWC-MDB-P-
2001 No. 18 of the RWC Music Database [229], [227]) and the last of which was
modulated. The bottom five rows show the list of various repeated sections (only
the five longest repeated sections are shown). For example, the second row from the
top indicates the structural repetition of 'verse A => verse B => chorus'; the bottom
row with two short coloured sections indicates the similarity between the 'intro' and
'ending'.
that corresponds to a cycle of the hehx; i.e., it refers to the position on the
circumference of the hehx seen from directly above. On the other hand, height
refers to the vertical position of the helix seen from the side (the position of
an octave).
Figure 11.9 shows an overview of calculating the chroma vector used in
the RefraiD method [224]. This represents magnitude distribution on the
chroma that is discretized into twelve pitch classes within an octave. The 12-
dimensional chroma vector v{t) is extracted from the magnitude spectrum,
^p{y^t) at the log-scale frequency v at time t, calculated by using the short-
time Fourier transform (STFT). Each element of v(t) corresponds to a pitch
class c (c = 1, 2 , . . . , 12) in the equal temperament and is represented as Vc{t)'.
QctH poo
^c{t)= Y, / ^PhA^)%{^.t)dv, (11.29)
/i = OctL * ^ ~ ^
The BPFc,/i(z/) is a bandpass filter that passes the signal at the log-scale centre
frequency Fc^h (in cents) of pitch class c (chroma) in octave position h (height),
where
Fc^h = 1200/1 -f 100(c - 1). (11.30)
The BPFc,/i(z/) is defined using a Manning window as follows:
Summation
Chroma vector over six octaves
B^v^chroma^_
C# D
Helix representation
of pitch perception
ICIC#IDID#IEIFIF#IGIG#I AIA#IBI
Fig. 11.9. Overview of calculating a 12-dimensional chroma vector. The magnitude
at six different octaves is summed into just one octave which is divided into 12
log-spaced divisions corresponding to pitch classes. Shepard's helix representation
of musical pitch perception [584] is shown at the right.
recognition capture spectral content and general pitch range, and are useful
for finding timbral or 'texture' repetitions. Dynamic features [512], [516] are
more adaptive spectral features that are designed for music structure discov-
ery through a supervised learning method. Those features are selected from
the spectral coefiicients of a filterbank output by maximizing the mutual in-
formation between the selected features and hand-labelled music structures.
The dynamic features are beneficial in that they reduce the size of the results
when calculating similarity (i.e., the size of the similarity matrix described in
Section 11.5.1) because the frame shift can be longer (e.g., 1 s) than for other
features.
Calculating Similarity
Given a feature vector such as the chroma vector or MFCC at every frame,
the next step is to calculate the similarity between feature vectors. Various
distance or similarity measures, such as the Euclidean distance and the cosine
angle (inner product), can be used for this. Before calculating the similarity,
feature vectors are usually normalized, for example, to a mean of zero and a
standard deviation of one or to a maximum element of one.
In the RefraiD method [224], the similarity r(t, /) between the feature vec-
tors (chroma vectors) v(^) and v{t I) is defined as
1 v{t) v{t-l)
r(t,0 = l - (11.32)
maxc Vc{t) maxc Vc{t I)
where / is the lag and Vc{t) is an element of v(^) (11.29). Since the denominator
A/12 is the length of the diagonal line of a 12-dimensional hypercube with edge
length 1, r{tj) satisfies 0 < r{tj) < 1.
11 Music Scene Description 349
r(t,l)
t(time)
lAlAlBlBlAlA
Fig. 11.10. An idealized example of a similarity matrix and time-lag triangle drawn
from the same feature vectors of a musical piece consisting of four 'A' sections and
two 'B' sections. The diagonal line segments in the similarity matrix or horizontal
line segments in the time-lag triangle, which represent similar sections, appear when
short-time pitch features like chroma vectors are used.
is drawn within a square in the two-dimensional {t-u) space.^ For the time-
lag triangle, the similarity r(t,/) between feature vectors v ( t ) and v ( t /) is
drawn within a right-angled isosceles triangle in the two-dimensional time-lag
[t-l) space. If a nearly constant t e m p o can be assumed, each pair of simi-
lar sections is represented by two non-central diagonal line segments in the
As described in Section 4.6, p. 112, the similarity matrix can also be used to
examine rhythmic structure.
350 Masataka Goto
1 *
T= l
where ui and LJ2 are the probabilities of class occurrence (number of peaks in
each class/total number of peaks), and /ii and /12 are the means of the peak
heights in each class.
^ Rail (^5 0 is evaluated along with the real-time audio input for a real-time system
based on RefraiD. On the other hand, it is evaluated at the end of a song for a non-
real-time off-line analysis.
^This can be considered the Hough transform where only horizontal lines are
detected: the parameter (voting) space Rail(^5 0 i^ therefore simply one dimensional
along /.
^Because the similarity r(T,/) is noisy, its sum Raii(^, 0 tends to be biased:
the longer the summation period for Rail(^5 05 ^he higher the summation result by
(11.34).
11 Music Scene Description 351
Line segment \ i _I
ccn-esponding to i
repetition of C ^-^
A: verse A
B: verse B r(t,l)
C:chorus
0 BiClCl |A|B|C|C|CT^^^^^'^^>^^-^^^^'^^
Fig. 11.11. A sketch of line segments, the similarity r(t,/) in the time-lag triangle,
and the possibility Raii(^, 0 ^^ containing line segments at lag /.
similarity 1
c o ^m region above a threshold |
Is
iwjj^ 'T*%#fl
Q E
^^\
is -c
_co o I
E
CO
timex
similarity 1
c o region above a threshold |
|i
CD CO
wH m%f^
-Q E
_co o
E
^
timex
Fig. 11.12. Examples of the similarity r[r^ h) at high-peak lags h. The bottom hor-
izontal bars indicate the regions above an automatically adjusted threshold, which
means they correspond to line segments.
For each picked-up high peak with lag / i , the line segments are finally
searched on the one-dimensional function r(r, h) [h <T <t). After smoothing
r(T,/i) using a moving average filter, the m e t h o d obtains line segments on
352 Masataka Goto
which the smoothed r(r, Zi) is above a threshold (Fig. 11.12). This threshold
is also adjusted through the automatic threshold selection method.
Instead of using the similarity matrix and time-lag triangle, there are other
approaches that do not explicitly find repeated sections. To segment music,
represent music as a succession of states (labels), and obtain a music thumbnail
or summary, these approaches segment and label (i.e., categorize) contiguous
frames (feature vectors) by using clustering techniques [417] or ergodic hidden
Markov models (HMMs) [417], [512], [516] (HMMs are introduced on p. 63 of
this volume).
Since each line segment in the time-lag triangle indicates just a pair of re-
peated sections, it is necessary to organize into a group the line segments that
have common sectionsi.e., overlap in time. When a section is repeated N
times {N > 3), the number of line segments to be grouped together should
theoretically be N{N l ) / 2 if all of them are found in the time-lag triangle.
Aiming to exhaustively detect all the repeated (chorus) sections appearing
in a song, the RefraiD method groups line segments having almost the same
section while redetecting some missing (hidden) line segments not found in the
bottom-up detection process (described in Section 11.5.2) through top-down
processing using information on other detected fine segments. In Fig. 11.11,
for example, two line segments corresponding to the repetition of the first
and third C and the repetition of the second and fourth C, which overlap
with the long line segment corresponding to the repetition of ABCC, can be
found even if they were hard to find in the bottom-up process. The method
also appropriately adjusts the start and end times of line segments in each
group because they are sometimes inconsistent in the bottom-up line segment
detection.
The processes described above do not deal with modulation (key change), but
they can easily be extended to it. A modulation can be represented by the
pitch difference of its key change, C ( 0 - l r - - - . l l ) , which denotes the number
of tempered semitones. For example, C = 9 means the modulation of nine
semitones upward or the modulation of three semitones downward. One of the
advantages of the 12-dimensional chroma vector v{t) is that a transposition
amount (" of the modulation can naturally correspond to the amount by which
its 12 elements are shifted (rotated). When v{t) is the chroma vector of a
certain performance and v(^)' is the chroma vector of the performance that is
modulated by C semitones upward from the original performance, they tend
to satisfy
v(^) ^ S^v(t)"^, (11.36)
11 Music Scene Description 353
/O 1 0 0\
0 0 1 0 0
S = (11.37)
0... 0 1 0
0... 0 1
\1 0 0/
To detect modulated repetition by using this feature of chroma vectors
and considering 12 destination keys, the RefraiD method [224] calculates 12
kinds of extended similarity as follows:
1 Sv(i) v(i-0
r^itj) (11.38)
maxc Vc{t) maxc Vc{t I)
Starting from each r(^{t, /), the processes of finding and grouping the repeated
sections are performed again. Non-modulated and modulated repeated sec-
tions are then grouped if they share the same section.
Since the above sections mainly describe the RefraiD method [224] with the
focus on detecting all chorus sections, this section briefly introduces other
methods [417], [27], [103], [110], [111], [512], [516], [23], [195], [104], [82], [664],
354 Masataka Goto
Music scene description methods that can deal with real-world audio signals of
musical pieces sampled from CD recordings have various practical applications
such as music information retrieval, music-synchronized computer graphics,
and music listening stations. The following sections introduce these applica-
tions.
356 Masataka Goto
Because the beat tracking described in Section 11.3 and Chapter 4 can be
used to automate the time-consuming tasks that must be done to synchronize
events with music, there are various applications. In fact, Goto and Muraoka
[235], [220], [221] developed a real-time system that displays virtual dancers
and several graphic objects whose motions and positions change in time to
beats (Fig. 11.13). This system has several dance sequences, each for a different
mood of dance motions. While a user selects a dance sequence manually, the
timing of each motion in the selected sequence is determined automatically
11 Music Scene Description 357
The automatic chorus section detection described in Section 11.5 enables new
music-playback interfaces that facilitate content-based manual browsing of
entire songs. As an application of the RefraiD method. Goto [226] developed
a music Hstening station for trial listening, called SmartMusicKIOSK. Cus-
tomers in music stores often search out the chorus or 'hook' of a song by
repeatedly pressing the fast-forward button, rather than passively listening to
the music. This activity is not well supported by current technology. SmartMu-
sicKIOSK provides the following two functions to facilitate an active listening
experience by eliminating the hassle of manually searching for the chorus and
making it easier for a listener to find desired parts of a song:
1. ^Jump to chorus' function: automatic jumping to the beginning of sections
relevant to a song's structure
Functions are provided enabling automatic jumping to sections that will
be of interest to listeners. These functions are 'jump to chorus (NEXT
CHORUS button)', 'jump to previous section in song (PREV SECTION
button)', and 'jump to next section in song (NEXT SECTION button)',
and they can be invoked by pushing the buttons shown above in paren-
theses (in the lower window of Fig. 11.14). With these functions, a listener
can directly jump to and listen to chorus sections, or jump to the previous
or next repeated section of the song.
2. 'Music map' function: visualization of song contents
A function is provided to enable the contents of a song to be visuahzed
to help the listener decide where to jump next. Specifically, this function
provides a visual representation of the song's structure consisting of chorus
sections and repeated sections, as shown in the upper window of Fig. 11.14.
While examining this display, the listener can use the automatic jump
buttons, the usual fast-forward/rewind buttons, or a playback slider to
move to any point of interest in the song.
This interface, which enables a listener to look for a section of interest
by interactively changing the playback position, is useful not only for trial
listening but also for more general purposes in selecting and using music.
While entire songs of no interest to a listener can be skipped on conventional
music-playback interfaces, SmartMusicKIOSK is the first interface that allows
the listener to easily skip sections of no interest even within a song.
358 Masataka Goto
11.8 Conclusion
This chapter has described the music scene description research approach to-
wards developing a system t h a t understands real-world musical audio signals
without deriving musical scores or separating signals. This approach is im-
p o r t a n t from an academic viewpoint because it explores what is essential for
understanding audio signals in a human-like fashion. T h e ideas and techniques
are expected to be extended to not only music signals b u t also general audio
signals including music, speech, environmental sounds, and mixtures of them.
Traditional speech recognition frameworks have been developed for dealing
with only monophonic speech signals or a single-pitch sound with background
noise, which should be removed or suppressed without considering their rela-
tionship. Research on understanding musical audio signals is a good starting
point for creating a new framework for understanding general audio signals,
because music is polyphonic, temporally structured, and complex, yet still
well organized. In particular, relationships between various simultaneous or
successive sounds are important and unique to music. This chapter, as well as
other chapters in this book, will contribute to such a general framework.
T h e music scene description approach is also important from industrial or
application viewpoints since end users can now easily 'rip' audio signals from
CDs, compress and store t h e m on a personal computer, load a huge number of
11 Music Scene Description 359
songs onto a portable music player, and listen to them anywhere and anytime.
These users want to retrieve and listen to their favourite music or a portion
of a musical piece in a convenient and flexible way. Reflecting these demands,
the target of processing has expanded from the internal content of individual
musical pieces to entire musical pieces and even sets of musical pieces [233].
While the primary target of music scene description is the internal content of
a piece, the obtained descriptions are useful for dealing with sets of musical
pieces as described in Section 11.7.1. The more accurate and detailed we can
make the obtained music scene descriptions, the more advanced and intelligent
music applications and interfaces will become.
Although various methods for detecting melody and bass lines, tracking
beats, detecting drums, and finding chorus sections have been developed and
successful results have been achieved to some extent, there is much room for
improving these methods and developing new ones. For example, in general
each method has been researched independently and implemented separately.
An integrated method exploiting the relationships between these descriptions
will be a promising next step. Other music scene descriptions apart from
those described in this chapter should also be investigated in the future. Ten
years ago it was considered too difficult for a computer to obtain most of the
music scene descriptions described here, but today we can obtain them with a
certain accuracy. I look forward to experiencing further advances in the next
ten years.
12
Singing Transcription
Matti Ryynanen
12.1 Introduction
Singing refers to the act of producing musical sounds with the human voice,
and singing transcription refers to the automatic conversion of a recorded
singing signal into a parametric representation (e.g., a MIDI file) by apply-
ing signal-processing methods. Singing transcription is an important topic
in computational music-content analysis since it is the most natural way of
human-computer interaction in the musical sense: even a musically untrained
subject is usually able to hum the melody of a piece. This chapter introduces
the singing transcription problem and presents an overview of the main ap-
proaches to solve it, including the current state-of-the-art singing transcription
systems.
During the last ten years, the rapid growth of digital music databases has
challenged researchers to develop natural user interfaces for accessing them
by using the singing voice. Consequently, most of the research on singing
transcription has been conducted in the context of query-by-humming systems
where singing transcription acts as a front end. After converting a singing
signal into a notated query, music pieces corresponding to the query can be
retrieved from the database. However, singing transcription enables a wide
range of other applications as well, including singing-input functionalities in
applications such as computer games or singing tutors, automatic tools for
annotating large corpora of singing, audio editor applications for professional
music production, and naturally, applications that convert singing signals into
musical scores.
SINGING ^
FEATURE
PRE-PROCESSING
SIGNAL EXTRACTION
ENHANCED
i ACOUSTIC AND
-^ NOTE SEGMENTATION -*
NOTE - POST-PROCESSING MUSICOLOGICAL
AND LABELING
SEQUENCE MODELS
NOTE SEQUENCE
Singing sounds are produced by the human vocal organ, which consists of
three basic units: (i) the respiratory system, (ii) the vocal folds, and (iii) the
vocal tract [614]. The sound production process is as follows. First, the respi-
ratory system creates an overpressure of air in the lungs, called the subglottic
pressure, which results in an air flow through the vocal folds. The vocal folds
start to vibrate and chop the air flow into a sequence of quasi-periodic air
pulses, thus producing a sound with a measurable fundamental frequency.
The sequence of air pulses is called the voice source and the process of sound
generation via the vocal fold vibration is referred to as phonation. At the fi-
nal stage, the voice source passes through the vocal tract, which modifies the
spectral shape and determines the timbre of the voice sound. This stage is
referred to as articulation and it controls the production of different speech
sounds and lyrics in singing.
Figure 12.2 shows a block diagram of the singing-sound production process.
The vocal-organ units control various acoustic properties of singing sounds,
including their fundamental frequency, timbre, and loudness. These are dis-
cussed in more detail in the following.
12 Singing Transcription 365
ACTIVITY: BREATHING PHONATION ARTICULATION
Fig. 12.2. The stages of the singing-sound production. The dashed arrows indicate
the acoustic properties controlled by the three vocal organ units. (Modified from
Sundberg [613, p. 10], used by permission.)
Fundamental frequency
Timbre
The vocal tract acts as the most important controller of the singing sound
timbre at the articulation stage. It functions as a resonating filter which em-
phasizes certain frequencies called the formant frequencies. These depend on
the configuration of the articulators, including the jaw, the tongue, and the
lips. In voiced sounds, the two lowest formants contribute to the identification
of the vowel and the higher formants to the personal voice timbre.
In addition, singing sound timbre is affected by the amount of subglot-
tic pressure and the tension of the vocal folds, resulting in different types
of phonation. The phonation types include pressed, normal, flow, breathy,
and whisper phonation, given in decreasing order of subglottic pressure. In
366 Matti Ryynanen
the normal phonation, the vocal folds are not completely closed. When the
subglottic pressure is high, the vocal folds close more rapidly and produce a
'pressed' sound. In the breathy and whisper phonations, the amount of sub-
glottic pressure is insufficient to properly vibrate the vocal folds and a part of
the air flow remains unchopped, thus producing a breathy-sounding phona-
tion. In the flow phonation, the produced sound is neither pressed nor leaky,
which is ideal for singing.
Loudness
The term vibrato refers to the modulation of the phonation frequency during
a performed note. Vibrato can be characterized with the rate and the depth of
the modulation. The rate typically varies between 4-7 Hz [425], and the depth
between 0.3-1 semitones. In [530], the depth of vibrato was measured in ten
recordings performed by professional singers and was reported to range from
12 Singing Transcription 367
TIME (s)
Fig. 12.3. A) The fundamental frequency curve and B) the loudness curve mea-
sured from a note A4 (440 Hz) performed by a female singer. The FO curve clearly
shows glissando and vibrato, whereas the loudness curve shows tremolo. Notice the
correlation between these two.
at unstressed metrical positions, during tragic song mood, or when the phona-
tion frequency is sharp (too high) rather than flat. It was also observed that
the direction of the deviation was related to the musical context [614].
Other types of tuning problems arise when singing is performed without
an accompaniment. In general, singers are not able to perform in absolute
tuning and, moreover, the tuning may drift, meaning that a singer gradually
changes the baseline tuning over time.
12.3.1 Pre-Processing
Time-Domain Methods
Autocorrelation-based FO estimators are widely used in singing transcription.
The idea of these is to measure the amount of correlation between a time-
domain signal and its time-shifted version. For periodic signals, the autocor-
relation function has local maxima at time shifts that equal the fundamental
period and its multiples.
Given a sampled time-domain signal s{k) and a frame length VF, the short-
time autocorrelation function rt{r) at time t is defined as
t+w-i
rt{T)= Yl s{k)s{k^r), (12.2)
k=t
where r is called the lag. Figure 12.4 shows rt{T) as calculated for a frame
of a singing signal. The function rt{r) peaks at lags which correspond to
the multiples of the fundamental period. A fundamental frequency estimate is
obtained by dividing the sampling rate of the signal with the smallest non-zero
lag value for which rt{r) reaches a value above a chosen threshold.
The autocorrelation method is straightforward to implement and easy to
use. A drawback is, however, that the method is sensitive to formants in
signal spectrum and therefore tends to make octave errors. Spectral whitening
makes the autocorrelation more robust in this respect. Autocorrelation has
been employed in the singing transcription systems of Ghias et al. [207], Shih
et al. [587], [586], [588], and Wang et al. [659], for example.
YIN algorithm
The YIN algorithm for FO estimation was proposed by de Cheveigne and
Kawahara [135]. It resembles the idea of autocorrelation but introduces cer-
tain improvements that make it more convenient to use. The YIN algorithm
was successfully used for singing transcription by Viitaniemi et al. [643] and
Ryynanen and Klapuri [558]. Given that s{k) is a discrete time-domain signal
with sampling rate /s, the YIN algorithm produces a FO estimate as follows.
1. Calculate the squared difference function dtir) where r is the lag:
t-\-w-i
dt{T)= Yl {s{k)-s{k + T)f. (12.3)
k=t
370 Matti Ryynanen
Fig. 12.4. A) An excerpt of a singing signal with sampling rate 44.1 kHz. B) The
ACF rtir) calculated using (12.2). C) The cumulative-mean-normalized difference
function d^r) calculated using (12.4) (YIN algorithm). See text for details.
3F0 - . ^ ^ PREDICTED-TO-MEASURED
"" - ~ *^
2F0 -*- - V"-"-"-V_VJt ^ MEASURED-TO-PREDICTED
TRIAL FO ^
Fig. 12.5. The two-way mismatch error calculation is baised on combining the
predicted-to-measured and mea^ured-to-predicted errors. The two-way matching
makes the method robust for both 'too low' and 'too high' octave errors. (After
Maher and Beauchamp [428].)
Frequency-Domain Methods
[528] and Haus and Pollastri [278], an FO estimate was determined by searching
for the most prominent peak in a spectrum between 80 Hz and 800 Hz for
which at least two clear harmonics were found. Another peak-detection scheme
was applied by Zhu and Kankanhalli in [689], where a set of the highest peaks
in the spectrum were detected and a rule-based peak-grouping algorithm was
employed to obtain an FO estimate.
Energy
Features related to the signal energy are widely used for note segmentation,
with the assumption that the signal level reflects the loudness of the target
singing voice. Segments where the signal energy exceeds a given threshold
value are often considered to be notes and the other segments are treated as
silence or background noise. Energy-related measures are straightforward to
use and work well if the notes are separated with more quiet regions. However,
this is not the case in most signals; usually there are also legato-type note
transitions. Therefore, robust note segmentation cannot be based on energy
measures alone.
One of the most often used energy measures is the root-mean-square
(RMS) energy (see Chapter 6 for the RMS definition). This feature has been
apphed by Haus and Pollastri [278], Pollastri [528], McNab et al. [454], [455],
and Shih et al. [587], [586], [588], for example. Different variants of energy
calculations have been applied in note segmentation, including the systems
by Clarisse et al. [92], Liu et al. [411], and Orio and Sette [489].
Voicing
A more reliable feature for note segmentation is the degree of voicing of a sig-
nal frame. Voiced frames possess clear periodicity, whereas unvoiced frames
can represent transient noise with a great amount of signal energy, or just
silence. Commonly, the voicing determination is embedded within the FO es-
timation algorithms. In autocorrelation-based FO estimators, for example, the
degree of voicing is straightforwardly given by the ACF function value at the
lag corresponds to the estimated fundamental period, divided by the value at
lag zero. If the ratio of these two does not exceed a given threshold, the frame
is considered to be unvoiced.
When the YIN algorithm is used for FO estimation, the degree of voic-
ing can be directly derived from the value d[{f) of the cumulative-mean-
normalized difference function defined in (12.4). The d[{T) value itself de-
scribes the amount of non-periodicity in FO measurement. To obtain a voicing
value, we have to map the d[{T) value to a voicing feature for example as
The voiced/unvoiced decision can also be based on the zero crossing rate
(ZCR) together with an energy measure, as done in [528], [421]. ZCR loosely
describes the brightness of the sound. High ZCR values imply transient or
noisy segments since these tend to have lots of energy at high frequencies.
ZCR is defined in Chapter 6.
Accents
Table 12.1 summarizes the features discussed in this section. Figure 12.7 shows
a selection of features extracted from a short singing excerpt containing five
notes performed by a professional female singer. From top to bottom, the pan-
els in the figure show the recorded singing waveform, fundamental frequency
estimates, the degree of voicing, RMS energy, the accent signal, and the zero
crossing rate. The panel with FO estimates also shows a manual transcription
of the notes in the performance (B4, A4, A4, Ftt4, and A4). The FO estimates
and the degree of voicing were obtained using the YIN algorithm and (12.5).
At the beginning of the excerpt there is silence, which can be observed
from the RMS and the degree of voicing. When the first note begins, the
accent signal has a clear peak. Interestingly, the second note begins with an
12 Singing Transcription 375
2 0.5
^ jf\f"^^f^S/^'
" % /
1/
z
o
5 0.5
.5||/'
z
LU
o
o
<
RMS
0.2 h - - ONSET THR
OFFSET THR
I I NOTE SEGMENTS
0.1
4 5 6
TIME (s)
Note Segmentation
N o t e Labelling
Note labelling follows the note segmentation. At this stage, each note segment
is assigned a pitch label such as an integer MIDI note number or a note name.
The most important question here is how to determine a single label for a note
segment where FO estimates are widely varying and possibly out of absolute
tuning. The different note-labelling schemes differ from each other mainly
in terms of how they handle the singer's tuning. This has been addressed
using three different assumptions:
1. The singer performs in absolute tuning (note A4 corresponds to 440 Hz);
2. The singer has a relative tuning in mind and makes consistent deviations
from the absolute tuning;
3. The singer allows the baseline tuning to drift during a performance.
Corresponding to the three assumptions, the transcription systems either per-
form no tuning, estimate a constant tuning, or perform time-adaptive tuning,
respectively. Three systems applying these three approaches are briefly intro-
duced in the following.
No tuning
Constant tuning
Haus and Pollastri assumed that the performed notes differ a constant amount
in semitones from the absolute tuning [278]. Figure 12.9 shows a block diagram
of their pitch-labelling process. Given a note segment, the FO estimates within
the segment were first 3-point median filtered to remove FO outliers. Then a
group of four contiguous frames with similar FO values constituted a block.
At the block level, legato with note pitch change was detected when pitch
between adjacent blocks had changed more than 0.8 semitones. In the case
of a detected legato, the note segment was divided into two new segments.
Otherwise, the adjacent blocks constituted a note event for which the FO was
calculated as the arithmetic mean of the FO in the blocks and represented as
an unrounded MIDI note number. This process was repeated for each note
segment, resulting in note segments with unrounded MIDI note labels.
To determine the deviation from the absolute tuning, the authors calcu-
lated a histogram of distances from the unrounded MIDI note numbers to their
nearest integer numbers. The highest peak in the histogram then indicated the
12 Singing Transcription 379
EVENT(S) 1 OC
^^
NOTE ^ CONSTANT AVERAGE IF LEGATO,
SEQUENCE TUNING PITCH SPLIT SEGMENT
UNROUNDED
MIDI NOTES
most-often occurring offset from the absolute tuning and, subsequently, every
note segment label was shifted by this offset, thus minimizing the rounding
error. Finally, the shifted MIDI note labels were rounded to integers, thus
obtaining the note pitch labels.
Time-adaptive tuning
analysis. The tokens are further weighted with predefined transition probabil-
ities between different HMMs. Eventually, the most probable path is defined
by the token with the maximum posterior probability and the corresponding
boundaries between different HMMs in the network.
Note-event model
Note events are described with a three-state left-to-right hidden Markov model
where P{6t = Cj \ 9t-i = e^) 7^ 0 only when j = z or j = i -h 1. The state ci in
the model represents the typical acoustic characteristics of the ith temporal
segment of a performed singing note. The model uses three features: funda-
mental frequency estimates (represented as unrounded MIDI note numbers),
the degree of voicing, and the accent signal. The features are extracted as ex-
plained in Section 12.3. Different notes are represented with a separate HMM
for each MIDI note n = 3 6 , . . . , 79. For note n, the features in frame t form
the observation vector Xt where the difference between the fundamental fre-
quency estimate and note n is used instead of the FO estimate directly. This
is referred to as the pitch difference AFQ:
The use of pitch difference facilitates the training of the model. Usually, there
is a limited amount of training data available, at least for each possible singing
note. Due the pitch-difference feature, it is possible to train only one set of
note HMM parameters with greater amount of data, and the same parameters
can be used to represent all the different MIDI notes.
The state-transition probabilities and the observation hkehhood distrib-
utions were estimated from an acoustic database containing audio material
performed by eleven non-professional singers. The singers were accompanied
by MIDI representations of the melodies which the singers heard through
headphones while performing. Only the performed melodies were recorded,
and later, the reference accompaniments were synchronized with the perfor-
mances. The reference notes were used to determine note boundaries in the
training material, and the Baum-Welch algorithm was then used to learn the
note HMM parameters.
Figure 12.10 illustrates the trained note HMM. The HMM states are shown
on top of the figure where the three states are referred to as the attack.
382 Matti Ryynanen
NOTE HMM
PITCH
DIFFERENCE
(SEMITONES)
DEGREE OF
VOICING
0.02
ACCENT
0.01
Fig. 12.10. A note HMM with three states: an attack, a sustain, and a silence/noise
state. The arrows show the possible state-transitions. The observation HkeUhood
distributions are shown below each state for the features used (pitch difference, the
degree of voicing, and accent).
the sustain, and the silence/noise state, each of which corresponds to a time
segment within note events. The arrows show the possible transitions between
the states. The observation likelihood distributions shown below each state
express the typical behaviour of these features during the different segments
within note events. It is a little surprising that unsupervised learning leads to
such an intuitive interpretation of the three states of the note model:
Attack: The singing pitch may vary about a few semitones from the actual
note pitch. Since the mean of the pitch-difference distribution is at 0.5
semitones, this implies that notes usually begin slightly flat. Accent value
distribution is widely spread, which indicates the presence of large accent
values during note attacks.
Sustain: The variance of pitch difference is much smaller than in the attack
state. Most of the FO estimates stay within 0.2 semitone distance from
the nominal pitch of the note. In addition, the frames during the sustain
state are mostly voiced (i.e., the FO estimation can be reliably performed)
and the accent values are small.
12 Singing Transcription 383
MUSICOLOGICAL MODEL
TOKENS _ ^ .
NOTE MODEL
'He-e^b^aaebdae^
-
CO
LJJ
h-
o
z
HaaeP/^aaep/daae
1 Y y
2 1 / v \
1 / X \
V J \
e-e^Haae
TIME
Fig. 12.11. Combination of the note models and the musicological modeL
Silence/noise: The FO estimates are almost random valued, since most of the
frames are unvoiced (notice the high likelihood of voicing values near 0.3).
The authors did not consider modelling the rests, since the note model
itself includes a state for silent or noisy segments. Thus the time instants
where the silent/noise state is visited can be considered as rests.
The HMMs of different MIDI notes are joined into a note model network illus-
trated in Figure 12.11. The probability of a transition from one note to another
is determined by a musicological model (explained in Section 12.4.3). Notice
that the figure shows the note models at successive time instants although
there actually exists only one note model per MIDI note. Singing melodies
can be transcribed by finding the most probable path through the network
according to the probabilities given by the note HMMs and the musicological
model. The authors used the token-passing algorithm for this purpose.
The transcription system of Ryynanen and Klapuri was evaluated using 57
singing melodies from the same database that was used for training the note
models (the training signals were not included in the evaluation set). An error
rate below 10% was achieved when the transcriptions were compared to the
reference MIDI notes. Simply rounding FO estimates to MIDI note numbers
produced an error rate of about 20%. The use of note models reduced the
error rate to 13%, and including also the musicological model decreased the
error rate below 10%. A similar note model network was later applied by
the authors to the transcription of polyphonic music from various genres [559].
384 Matti Ryynanen
Orio and Sette used a HMM-based note modelling technique for transcribing
singing queries [489]. Their approach was similar to the system of Ryynanen
and Klapuri: each MIDI note was represented by its own HMM, which had
three states for modelling the attack, sustain, and rest segments in singing
notes. The used features included the logarithm of signal energy, spectral
energy on the first harmonics of each modelled note for FO detection, and
the first derivatives of these. The observation likelihood distributions were
derived by making statistical analysis of a set of labelled audio examples.
The HMMs of different notes were joined into a note network, and the Viterbi
algorithm was used to decide both the note segments and the pitch note labels
simultaneously, thus producing transcriptions of singing queries.
Their preliminary results indicated that most of the labelhng errors were
due to glissandi between notes. Another common error was to insert additional
note boundaries during note attacks. The authors discussed the interesting
possibility of using several attack states for modelling different types of note
beginnings and an enhanced sustain state with two additional states modelling
slight detunings upwards and downwards from the note pitch. However, these
ideas were not implemented in the reported system.
Viitaniemi et al. proposed an HMM for singing transcription, where the
HMM states corresponded to different notes [643]. The observation likelihood
distributions consisted of the FO value distributions for each note and the
state-transitions were controlled with a musicological model. The optimal
state sequence was found with the Viterbi algorithm in order produce note
labels at each time instant. In the optimal state sequence, transitions between
different states were interpreted as note boundaries. The system achieved error
rates around 13% for the database that was used by Ryynanen and Klapuri
in [558].
Shih et al. used a three-state HMM to model note events [587]. The features
included mel-frequency cepstral coefficients (MFCC), one energy measure, and
a pitch ratio. The pitch ratio was calculated as
detected note, leading to the constant and the time-adaptive tuning, respec-
tively (see Section 12.4.1). The time-adaptive tuning was reported to work
better for which approximately 80% of notes were correctly labelled.
Shih et al. soon modified their system to transcribe singing in two consec-
utive stages of note segmentation and labelling [588]. The note segmentation
was performed with two three-state HMMs, one of which modelled note seg-
ments and the other rests. The note and rest segments were then found using
the Viterbi algorithm. After a note segment was detected, it was labelled with
respect to its previous note by using pitch interval in semitones and the stan-
dard deviation of FO estimates within the note segment. These attribute values
were modelled with a Gaussian mixture model (GMM) and the interval label
was decided according to the most likely GMM. Some minor improvements
in transcription quality were achieved compared to their previous system. In
addition, the improved system operated in real time.
Musical Key
>
o
MAJOR KEY MINOR KEY
D
4 5
Lu
8
F
9 10 11
G
8 DISTANCE FROM THE TONIC PITCH CLASS (SEMITONES)
Fig. 12.12. Pitch class occurrence frequencies in major and minor keys with respect
to the tonic pitch class (after Krumhansl [377, p. 67]). As an example, the pitch class
names are listed below the figure axes for the relative keys C major and A minor.
pitch classes such as C, E, and A often occur in both of these keys, and the
pitch classes which belong to the scale of the key are much more frequent
than the others. To conclude, given the key of a singing performance, we may
determine the probabilities of different pitch classes from a musical point of
view and use this knowledge to solve ambiguous note labellings.
Viitaniemi et al. proposed a key estimation method which produces the prob-
abilities of different keys given a sequence of FO estimates [643]. The required
prior knowledge consists of the occurrence probabilities of different notes n
given a key k, P{n \k). As only the pitch class of a note is assumed to affect
the mentioned probability, distributions such as those shown in Fig. 12.12 can
be used. Then the probability of a key given a note is obtained using Bayes's
formula:
P{k I n) - P{n I k)P{k) (12.8)
P{n)
Further, the authors used singing performances as training data to estimate
the probabilities of different FO values given a note n, i.e., P{FQ |n), which
was modelled with a GMM. This was to represent singing pitch deviation from
the nominal note pitch, and the FO estimation errors. Then the probability of
a key given an FO estimate was calculated as
P(fc|Fo)
P ( F o I k)P{k) (assumpt.;
E P{Fo I n)Pin I k) m_
P{Fo)
all notes n
The method was reported to find the correct relative-key pair in 86% of singing
performances [643]. The estimated key probabihties were appHed as a musi-
cological model to favour certain notes in singing transcription. At time t, the
probability of note rit given the sequence of FO estimates up to time t was
defined as
24
where the index k sums over all the 24 major and minor keys. It should be
noted that the above equation gives the probability of note n at time t from
the viewpoint of the musicological model alone, without any special emphasis
to the most recent observation Fo(t).
Note AT-Grams
Note AT-grams formulate the idea that the probability of a note depends on
the previous A^ 1 notes. The probability of a note rit given the previous notes
ni:t-i is here denoted by P{nt | ni:t_i). A^-grams are based on (A^l)-th order
Markov assumption which can be written as
The note A^-gram probabilities are estimated from databases containing note
sequences. In practice, the databases do not include all the possible note
sequences, and therefore a method is usually employed to estimate also
the probabilities of the non-occurring note sequences. In general, this process
is called smoothing. For different smoothing methods, see [321].
Ryynanen and Klapuri applied note A^-gram probabilities for A^ G {2,3}
under a given key to control the transitions between different note HMMs
[558] (see Fig. 12.11). The probability of note rit at time t was defined as
^ ( ^ t l^prev,^:), where k denotes the key and riprev = '^t-N^i-.t-i is used to
denote the N 1 previous notes for convenience. The probabilities were es-
timated by counting the occurrences of different note sequences in a large
database of monophonic MIDI files and by smoothing them with the Witten-
Bell discounting algorithm [673].
The estimated note AT-gram probabilities were applied in singing transcrip-
tion as follows. First, the major and the minor keys /cmaj, ^min of the most
probable relative key-pair were determined from the singing performance us-
ing the key estimation method of Viitaniemi et al. [643]. Then the probability
of moving to note rit was obtained by
If the key information was not available, the probability was given by
1 ^^
P{nt\nprev) = ^Yl -^Kl^prev, k) , (12.13)
k=l
that is, assuming all the major and minor keys equally probable. If the mu-
sicological model was completely disabled, equal transition probabilities were
used for all note transitions.
Simulation experiments with the described musicological model showed
that the use of note A/'-grams without the key information did not improve
transcription results compared to the situation where the musicological model
was completely disabled. When using the key information with the note N-
grams, however, the error rate was reduced from 13% to under 10%, where the
note bigrams {N = 2) worked slightly better than the note trigrams {N = 3)
[558], [557].
Metrical Context
Summary
context modelling are still under development. If the musical key or the tempo
of a performance is known in advance, the quality of the transcription is very
likely to be improved. Use of the tonal context for pitch labelling appears to
be more important than the use of the metrical context for note segmenta-
tion. However, the metrical context becomes essential if in addition the note
durations are quantized, as described in Chapter 4.
a velocity parameter which indicates the loudness of the note. To decide the
velocity value of an entire note, the RMS energies of voiced frames can be
averaged and mapped to a velocity value, for example.
The note event models which consider different types of time segments
during the singing notes allow different measurements to be made within
the different segments of transcribed notes. The amount of glissando can
be measured during the attack segment, vibrato and tremolo during the
sustain segment, and so forth. The estimation and encoding of expres-
sion parameters improves the quality of resynthesized singing transcrip-
tions considerably. Some examples on expression encoding can be found at
http://www.cs.tut.fi/sgn/axg/matti/demos/monomel.
29. S.D. Bella and I. Peretz. Music agnosias: Selective impairments of music recog-
nition after brain damage. Journal of New Music Research^ 28(3):209-216,
1999.
30. J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M.B. Sandler.
A tutorial on onset detection in music signals. IEEE Transactions on Speech
and Audio Processing^ 2005. (in press).
31. J.P. Bello and M. Sandler. Phase-based note onset detection for music signals.
In IEEE International Conference on Acoustics, Speech, and Signal Processing^
Hong Kong, China.
32. L. Benaroya, F. Bimbot, L. McDonagh, and R. Gribonval. Non-negative sparse
representation for Wiener based source separation with a single sensor. In IEEE
International Conference on Acoustics, Speech, and Signal Processing^ Hong
Kong, China, 2003.
33. A.L. Berenzweig and D.P.W. Ellis. Locating singing voice segments within
music signals. In IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics, pp. 119-122, New Paltz, USA, October 2001.
34. J. Berger, R. Coifman, and M. Goldberg. Removing noise from music using lo-
cal trigonometric bases and wavelet packets. Journal of the Audio Engineering
Society, 42(10):808-818, 1994.
35. D.P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, Massa-
chusetts, 1995.
36. E. Bigand. Contributions of music to research on human auditory cognition.
In S. McAdams and E. Bigand, editors. Thinking in Sound: The Cognitive
Psychology of Human Audition, pp. 231-277. Oxford University Press, 1993.
37. J.A. Bilmes. Timing is of the Essence: Perceptual and Computational Tech-
niques for Representing, Learning, and Reproducing Expressive Timing in Per-
cussive Rhythm. Master's thesis, Massachusetts Institute of Technology, Sep-
tember 1993.
38. J. A. Bilmes. A gentle tutorial of the EM algorithm and its appHcation to para-
meter estimation for Gaussian mixture and hidden Markov models. Technical
report. International Computer Science Institute, Berkeley, USA, 1998.
39. E. Bingham and H. Mannila. Random projection in dimensionality reduction:
AppKcations to image and text data. In The Seventh A CM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 245-250, San
Francisco, California, 2001.
40. C M . Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, Oxford, England, 1995.
41. S.S. Blackman and R. Popoli. Design and Analysis of Modem Tracking Sys-
tems, Artech House, 1999.
42. A. Blum and P. Langley. Selection of relevant features and examples in machine
learning. Artificial Intelligence, 97(l-2):245-271, 1997.
43. T. Blumensath and M. Davies. Unsupervised learning of sparse and shift-
invariant decompositions of polyphonic music. In IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, Montreal, Canada, 2004.
44. J. Bonada and A. Loscos. Sample-based singing voice synthesizer by spectral
concatenation. In Stockholm Music Acoustics Conference, Stockholm, Sweden,
2003.
45. I. Borg and P. Groenen. Modem Multidimensional Scaling: Theory and Appli-
cations. Springer, 1997.
394 References
46. G.C. Bowker and S.L. Star. Sorting Things Out: Classification and Its Conse-
quences. MIT Press, Cambridge, USA, 1999.
47. G.E.P. Box and D.R. Cox. An analysis of transformations. Journal of the Royal
Statistical Society: Series B, 26:296-311, 1964.
48. R. Bracewell. Fourier Transform and Its Applications. McGraw-Hill, 1999.
49. A.S. Bregman. Auditory Scene Analysis. MIT Press, Cambridge, USA, 1990.
50. A.S. Bregman. Constraints on computational models of auditory scene analysis,
as derived from human perception. Journal of the Acoustical Society of Japan
(E), 16(3): 133-136, May 1995.
51. L. Breiman. Random forests. Machine Learning, 45:5-32, 2001.
52. G.J. Brown. Computational Auditory Scene Analysis: A Representational Ap-
proach. PhD thesis. University of Sheffield, Sheffield, UK, 1992.
53. G.J. Brown and M. Cooke. Perceptual grouping of musical sounds: A compu-
tational model. Journal of New Music Research, 23(1):107-132, 1994.
54. J.C. Brown. Musical fundamental frequency tracking using a pattern recogni-
tion method. Journal of the Acoustical Society of America, 92(3): 1394-1402,
1992.
55. J.C. Brown. Determination of the meter of musical scores by autocorrelation.
Journal of the Acoustical Society of America, 94(4): 1953-1957, October 1993.
56. J.C. Brown. Computer identification of musical instruments using pattern
recognition with cepstral coefficients as features. Journal of the Acoustical So-
ciety of America, 105:1933-1941, 1999.
57. J.C. Brown, O. Houix, and S. McAdams. Feature dependence in the automatic
identification of musical woodwind instruments. Journal of the Acoustical So-
ciety of America, 109:1064-1072, 2001.
58. J.C. Brown and P. Smaragdis. Independent component analysis for automatic
note extraction from musical trills. Journal of the Acoustical Society of Amer-
ica, 115, May 2004.
59. C.J.C. Burges. A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2(2): 121-167, 1998.
60. E.M. Burns. Intervals, scales, and tuning. In Deutsch [144], pp. 215-264.
61. C.S. Burrus. Multiband least square FIR filter design. IEEE Transactions on
Signal Processing, 43(2):412-421, February 1995.
62. P. Cano, M. Koppenberger, S. Le Groux, J. Ricard, P. Herrera, and N. Wack.
Nearest-neighbor automatic sound annotation with a wordnet taxonomy. Jour-
nal of Intelligent Information Systems, 24:99-111, 2005.
63. J.-F. Cardoso. Source separation using higher order moments. In IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing, pp. 2109-2112,
Glasgow, UK, 1989.
64. J.-F. Cardoso. Multidimensional independent component analysis. In IEEE
International Conference on Acoustics, Speech, and Signal Processing, Seattle,
USA, 1998.
65. J.-F. Cardoso. High-order contrasts for independent component analysis.
Neural Computation, 11(1), 1999.
66. P. Cariani. Recurrent timing nets for auditory scene analysis. In International
Joint Conference on Neural Networks, Portland, Oregon, July 2003.
67. P.A. Cariani and B. Delgutte. Neural correlates of the pitch of complex tones.
I. Pitch and pitch salience. II. Pitch shift, pitch ambiguity, phase invariance,
pitch circularity, rate pitch, and the dominance region for pitch. Journal of
Neurophysiology, 76(3): 1698-1734, 1996.
References 395
68. R.P. Carlyon. Temporal pitch mechanisms in acoustic and electric hearing.
Journal of the Acoustical Society of America, 112(2):621-633, 2002.
69. F. Carreras, M. Leman, and M. Lesaffre. Automatic description of musical sig-
nals using schema-based chord decomposition. Journal of New Music Research,
28(4):310-331, 1999.
70. M.A. Casey. Auditory Group Theory with Applications to Statistical Basis
Methods for Structured Audio. PhD thesis, Massachusetts Institute of Tech-
nology, 1998.
71. M.A. Casey. General sound classification and similarity in MPEG-7. Organized
Sound, 6:153-164, 2001.
72. M.A. Casey. MPEG-7 sound-recognition tools. IEEE Transactions on Circuits
and Systems for Video Technology, 11(6), June 2001.
73. M.A. Casey and A. Westner. Separation of mixed audio sources by indepen-
dent subspace analysis. In International Computer Music Conference, Berlin,
Germany, 2000.
74. A. Cemgil. Bayesian Music Transcription. PhD thesis, Nijmegen University,
2004.
75. A.T. Cemgil, P. Desain, and B. Kappen. Rhythm quantization for transcrip-
tion. Computer Music Journal, 24(2):60-76, 2000.
76. A.T. Cemgil and B. Kappen. Tempo tracking and rhythm quantization by
sequential Monte Carlo. In Neural Information Processing Systems, Vancouver,
British Columbia, Canada, 2001.
77. A.T. Cemgil and B. Kappen. Monte Carlo methods for tempo tracking and
rhythm quantization. Journal of Artificial Intelligence Research, 18:45-81,
2003.
78. A.T. Cemgil, B. Kappen, and D. Barber. A generative model for music tran-
scription. IEEE Transactions on Speech and Audio Processing, 13(6), 2005.
79. A.T. Cemgil, B. Kappen, P. Desain, and H. Honing. On tempo tracking: Tem-
pogram representation and Kalman filtering. Journal of New Music Research,
28(4):259-273, 2001.
80. C. Chafe and D. Jaffe. Source separation and note identification in polyphonic
music. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 1289-1292, Atlanta, USA, 1986.
81. C. Chafe, J. Kashima, B. Mont-Reynaud, and J. Smith. Techniques for note
identification in polyphonic music. In International Computer Music Confer-
ence, pp. 399-405, Vancouver, Canada, 1985.
82. W. Chai and B. Vercoe. Structural analysis of musical signals for indexing and
thumbnailing. In ACM/IEEE Joint Conference on Digital Libraries, pp. 27-34,
Texas, USA, 2003.
83. G. Charbonneau. Timbre and the perceptual effects of three types of data
reduction. Organized Sound, 5:10-19, 1981.
84. F.J. Charpentier. Pitch detection using the short-term phase spectrum. In
IEEE International Conference on Acoustics, Speech, and Signal Processing,
pp. 113-116, Tokyo, Japan, 1986.
85. G. Chechik, A. Globerson, M.J. Anderson, E.D. Young, I. Nelken, and
N. Tishby. Group redundancy measures reveal redundancy reduction in the au-
ditory pathway. In Neural Information Processing Systems, Vancouver, British
Columbia, Canada, 2001.
86. C.E. Cherry. Some experiments on the recognition of speech, with one and with
two ears. Journal of the Acoustical Society of America, 25(5):975-979, 1953.
396 References
106. H. Cottereau, J.-M. Fiasco, C. Doncarli, and M. Davy. Two approaches for the
estimation of time-varying ampUtude multichirp signals. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China,
April 2003.
107. M.S. Grouse, R.D. Nowak, and R.G. Baraniuk. Wavelet-based signal processing
using hidden Markov models. IEEE Transactions on Signal Processing, 46:886-
902, April 1998. Special issue on filter banks.
108. R. Cusack and R.P. Carlyon. Auditory perceptual organization inside and out-
side the laboratory. In J.G. Neuhoff, editor, Ecological Psychoacoustics, pp. 15-
48. Elsevier Academic Press, 2004.
109. R.B. Dannenberg, W.P. Birmingham, G. Tzanetakis, G. Meek, N. Hu, and
B. Pardo. The MUSART testbed for query-by-humming evaluation. In Inter-
national Conference on Music Information Retrieval, pp. 41-47, Baltimore,
USA, October 2003.
110. R.B. Dannenberg and N. Hu. Discovering musical structure in audio recordings.
In International Conference on Music And Artificial Intelligence, pp. 43-57,
Edinburgh, Scotland, UK, September 2002.
111. R.B. Dannenberg and N. Hu. Pattern discovery techniques for music audio.
Journal of New Music Research, 32(2):153-163, 2003.
112. T. Dasu and T. Johnson. Exploratory data mining and data cleaning. John
Wiley Sz Sons, second edition, 2003.
113. I. Daubechies. Ten Lectures on Wavelets. SIAM, Philadelphia, PA, 1992.
114. L. Daudet, S. Molla, and B. Torresani. Towards a hybrid audio coder. In Jian-
ping Li, editor. Wavelets and Its Applications, Proceedings of a Conference
Held in Chongqing (China). World Scientific Publishing Gompany, 2004.
115. L. Daudet and B. Torresani. Hybrid representations for audiophonic signal
encoding. Signal Processing, 82(11):1595-1617, 2002. Special issue on image
and video coding beyond standards.
116. Laurent Daudet. Sparse and structured decompositions of signals with the
molecular matching pursuit. IEEE Transactions on Acoustics, Speech, and Sig-
nal Processing, 2004. (in press).
117. W.B. Davenport and W.L. Root. An Introduction to the Theory of Random
Signals and Noise. IEEE Press, New York, 1987.
118. M.E. Davies and M.D. Plumbley. Gausal tempo tracking of audio. In Interna-
tional Conference on Music Information Retrieval, pp. 164-169, 2004.
119. G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximations. Con-
structive Approximation, 13:57-98, 1997.
120. M. Davy. Bayesian separation of harmonic sources. In Joint Statistical Meeting
of the American Statistial Association, Minneapolis, USA, August 2005.
121. M. Davy, G. Doncarli, and J.Y. Tourneret. Glassification of chirp signals using
hierarchical Bayesian learning and MGMG methods. IEEE Transactions on
Signal Processing, 50(2):377-388, 2002.
122. M. Davy and S. Godsill. Bayesian harmonic models for musical signal analy-
sis. In Seventh Valencia International meeting Bayesian statistics 7, Tenerife,
Spain, June 2002.
123. M. Davy and S. Godsill. Detection of abrupt spectral changes using support
vector machines: An application to audio signal segmentation. In IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing, Volume 2,
pp. 1313-1316, Orlando, USA, May 2002.
398 References
124. M. Davy, S. Godsill, and J. Idler. Bayesian analysis of polyphonic western tonal
music. Journal of the Acoustical Society of America^ 2005. (in press).
125. M. Davy and S.J. Godsill. Audio information retrieval: A bibliographical
study. Technical Report CUED/F-INFENG/TR.429, Department of Engineer-
ing, University of Cambridge, February 2002.
126. M. Davy and S.J. Godsill. Bayesian harmonic models for musical pitch estima-
tion and analysis. Technical Report CUED/F-INFENG/TR.431, Department
of Engineering, University of Cambridge, Cambridge, UK, April 2002.
127. M. Davy and J. Idler. Fast MCMC computations for the estimation of
sparse processes from noisy observations. In IEEE International Conference
on Acoustics, Speech, and Signal Processing^ Montreal, Canada, May 2004.
128. E. de Boer and H.R. de Jongh. On cochlear encoding: Potentials and Hmita-
tions of the reverse-correlation technique. Journal of the Acoustical Society of
America, 63(1):115-135, 1978.
129. A. de Cheveigne. Separation of concurrent harmonic sounds: Fundamental fre-
quency estimation and a time-domain cancellation model for auditory process-
ing. Journal of the Acoustical Society of America, 93(6):3271-3290, 1993.
130. A. de Cheveigne. Concurrent vowel identification. III. A neural model of har-
monic interference cancellation. Journal of the Acoustical Society of America,
101(5):2857-2865, 1997.
131. A. de Cheveigne. Cancellation model of pitch perception. Journal of the
Acoustical Society of America, 103(3):1261-1271, 1998.
132. A. de Cheveigne. Pitch perception models. In Plack and Oxenham [522].
133. A. de Cheveigne and H. Kawahara. Multiple period estimation and pitch per-
ception model. Speech Communication, 27:175-185, 1999.
134. A. de Cheveigne and H. Kawahara. Comparative evaluation of FO estimation
algorithms. In 7th European Conf. Speech Communication and Technology, Aal-
borg, Denmark, 2001.
135. A. de Cheveigne and H. Kawahara. YIN, a fundamental frequency estimator for
speech and music. Journal of the Acoustical Society of America, 111(4):1917-
1930, 2002.
136. T. De Mulder, J.P. Martens, M. Lesaffre, M. Leman, B. De Baets, and
H. De Meyer. An auditory model based transcriber of vocal queries. In Inter-
national Conference on Music Information Retrieval, Baltimore, USA, 2003.
137. T. De Mulder, J.P. Martens, M. Lesaffre, M. Leman, B. De Baets, and
H. De Meyer. Recent improvements of an auditory model based front-end
for the transcription of vocal queries. In IEEE International Conference on
Acoustics, Speech, and Signal Processing, Volume 4, pp. 257-260, Montreal,
Canada, 2004.
138. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-
complete data via the EM algorithm. Journal of the Royal Statistical Society:
Series B, 39(l):l-38, 1977.
139. P. Depalle, G. Garcia, and X. Rodet. Tracking of partials for additive sound
synthesis using hidden Markov model. In IEEE International Conference on
Acoustics, Speech, and Signal Processing, Volume 1, pp. 225-228, Minneapolis,
USA, 1993.
140. P. Desain and H. Honing. The quantization of musical time: A connectionist
approach. Computer Music Journal, 13(3):56-66, 1989.
141. P. Desain and H. Honing. Computational models of beat induction: The rule-
based approach. Journal of New Music Research, 28(l):29-42, 1999.
References 399
142. P. Desain and H. Honing. The formation of rhythmic categories and metric
priming. Perception, 32(3):241-365, 2003.
143. F. Desobry, M. Davy, and C. DoncarH. An onhne kernel change detection al-
gorithm. IEEE Transactions on Signal Processing, 2005. (in press).
144. D. Deutsch, editor. The Psychology of Music. Academic Press, San Diego,
California, 2nd edition, 1999.
145. T.G. Dietterich. Approximate statistical tests for comparing supervised classi-
fication learning algorithms. Neural Computation, 10:1895-1923, 1998.
146. T.G. Dietterich. Machine learning for sequential data: A review. In T. Caelli,
A. Amin, R.P.W. Duin, M. Kamel, and de Kidder D., editors. Structural, Syn-
tactic, and Statistical Pattern Recognition, Volume 2396 of Lecture Notes in
Computer Science, pp. 15-30. Springer-Verlag, 2002.
147. C. Dittmar and C. Uhle. Further steps towards drum transcription of poly-
phonic music. In Audio Engineering Society 116th Convention, Berlin, Ger-
many, May 2004.
148. S. Dixon. Automatic extraction of tempo and beat from expressive performa-
ces. Journal of New Music Research, 30(l):39-58, 2001.
149. S. Dixon, F. Gouyon, and G. Widmer. Towards characterisation of music
via rhythmic patterns. In International Conference on Music Information Re-
trieval, pp. 509-516, Barcelona, Spain, 2004.
150. M. Dorfler. Gabor Analysis for a Class of Signals called Music. PhD thesis,
NuHAG, University of Vienna, 2002.
151. D. Dorran and R. Lawlor. An efficient audio time-scale modification algorithm
for use in a sub-band implementation. In International Conference on Digital
Audio Effects, pp. 339-343, 2003.
152. A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in
Practice. Springer, New York, USA, 2001.
153. B. Doval and X. Rodet. Estimation of fundamental frequency of musical sound
signals. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 3657-3660, Toronto, Canada, 1991.
154. J.S. Downie, J. Futrelle, and D. Tcheng. The international music information
systems evaluation laboratory: Governance, access and security. In Interna-
tional Conference on Music Information Retrieval, Barcelona, Spain, 2004.
155. C. Drake, A. Penel, and E. Bigand. Tapping in time with mechanical and
expressively performed music. Music Perception, 18(l):l-23, 2000.
156. J. Drish. Obtaining calibrated probability estimates from support vector ma-
chines. Technical report, University of California, San Diego, California, USA,
June 2001.
157. S. Dubnov. Extracting sound objects by independent subspace analysis. In
22nd International Audio Engineering Society Conference, Espoo, Finland,
June 2002.
158. C. Dubois and M. Davy. Harmonic tracking in spectrograms. IEEE Transac-
tions on Signal Processing, 2005. Submitted.
159. C. Dubois and M. Davy. Harmonic tracking using sequential Monte Carlo. In
IEEE Statistical Signal Processing Workshop, Bordeaux, France, July 2005.
160. C. Dubois and M. Davy. Suivi de trajectoires temps-frequence par filtrage par-
ticulaire. In 20st GRETSI colloquium, Louvain-La-Neuve, Belgium, September
2005. In French.
161. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley, New
York, USA, second edition, 2001.
400 References
162. C. Duxbury, J.P. Bello, M. Davies, and M. Sandler. A combined phase and
amplitude based approach to onset detection for audio segmentation. In 4th
European Workshop on Image Analysis for Multimedia Interactive Services,
London, UK, 2003.
163. C. Duxbury, J.P. Bello, M. Davies, and M. Sandler. Complex domain onset
detection for musical signals. In International Conference on Digital Audio
Effects, London, UK, 2003.
164. C. Duxbury, M. Sandler, and M.E. Davies. A hybrid approach to musical note
onset detection. In International Conference on Digital Audio Effects, Ham-
burg, Germany, September 2002.
165. D. Eck. A positive-evidence model for classifying rhythmical patterns. Techni-
cal Report IDSIA-09-00, Dalle Molle Institute for Artificial Intelligence, 2000.
166. D. Eck. A network of relaxation oscillators that finds downbeats in rhythms.
Technical Report IDSIA-06-01, Dalle Molle Institute for Artificial Intelligence,
2001.
167. J. Eggink and G.J. Brown. Application of missing feature theory to the recog-
nition of musical instruments in polyphonic audio. In International Conference
on Music Information Retrieval, pp. 125-131, Baltimore, USA, 2003.
168. J. Eggink and G.J. Brown. Application of missing feature theory to the recog-
nition of musical instruments in polyphonic audio. In IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, pp. 553-556, Hong Kong,
China, 2003.
169. J. Eggink and G.J. Brown. Extracting melody lines from complex audio. In In-
ternational Conference on Music Information Retrieval, pp. 84-91, Barcelona,
Spain, October 2004.
170. J. Eggink and G.J. Brown. Instrument recognition in accompanied sonatas and
concertos. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. 217-220, Montreal, Canada, 2004.
171. D.P.W. Ellis. Prediction-Driven Computational Auditory Scene Analysis. PhD
thesis, Massachusetts Institute of Technology, 1996.
172. D.P.W. Elhs. Using knowledge to organize sound: The prediction-driven
approach to computational auditory scene analysis, and its application to
speech/nonspeech mixtures. Speech Communication, 27:281-298, 1999.
173. D.P.W. Elhs and D.F. Rosenthal. Mid-level representations for computational
auditory scene analysis. In International Joint Conference on Artificial Intel-
ligence, Montreal, Quebec, 1995.
174. A. Eronen. Automatic musical instrument recognition. Master's thesis, Tam-
pere University of Technology, 2001.
175. A. Eronen. Musical instrument recognition using ICA-based transform of fea-
tures and discriminatively trained HMMs. In Seventh International Symposium
on Signal Processing and its Applications, pp. 133-136, Paris, Prance, 2003.
176. A. Eronen and A. Klapuri. Musical instrument recognition using cepstral coef-
ficients and temporal features. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. 753-756, Istanbul, Turkey, 2000.
177. S. Essid, G. Richard, and B. David. Musical instrument recognition based on
class pairwise feature selection. In International Conference on Music Infor-
mation Retrieval, Barcelona, Spain, 2004.
178. S. Essid, G. Richard, and B. David. Musical instrument recognition on solo
performance. In European Signal Processing Conference, Vienna, Austria, 2004.
References 401
220. M. Goto. A Study of Real-Time Beat Tracking for Musical Audio Signals. PhD
thesis, Waseda University, 1998.
221. M. Goto. An audio-based real-time beat tracking system for music with or
without drum-sounds. Journal of New Music Research, 30(2):159-171, 2001.
222. M. Goto. A predominant-FO estimation method for CD recordings: MAP esti-
mation using EM algorithm for adaptive tone models. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, Volume 5, pp. 3365-
3368, Salt Lake City, USA, May 2001.
223. M. Goto. A predominant-FO estimation method for real-world musical audio
signals: Map estimation for incorporating prior knowledge about FOs and tone
models. In Proc. Workshop on Consistent and Reliable Acoustic Cues for Sound
Analysis, Aalborg, Denmark, 2001.
224. M. Goto. A chorus-section detecting method for musical audio signals. In
IEEE International Conference on Acoustics, Speech, and Signal Processing,
Volume 5, pp. 437-440, Hong Kong, China, April 2003.
225. M. Goto. Music scene description project: Toward audio-based real-time music
understanding. In International Conference on Music Information Retrieval,
pp. 231-232, Baltimore, USA, October 2003.
226. M. Goto. SmartMusicKIOSK: Music listening station with chorus-search func-
tion. In ACM Symposium on User Interface Software and Technology, pp. 3 1 -
40, Vancouver, British Columbia, Canada, 2003.
227. M. Goto. Development of the RWC music database. In the 18th International
Congress on Acoustics, Volume 1, pp. 553-556, Kyoto, Japan, 2004.
228. M. Goto. A real-time music scene description system: Predominant-FO esti-
mation for detecting melody and bass lines in real-world audio signals. Speech
Communication, 43(4) :311-329, 2004.
229. M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database:
Popular, classical, and jazz music databases. In International Conference on
Music Information Retrieval, pp. 287-288, Paris, Prance, October 2002.
230. M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC music database:
Music genre database and musical instrument sound database. In International
Conference on Music Information Retrieval, pp. 229-230, Baltimore, USA,
October 2003.
231. M. Goto and S. Hayamizu. A real-time music scene description system: Detect-
ing melody and bass lines in audio signals. In International Joint Conference
on Artificial Intelligence, pp. 31-40, Stockholm, Sweden, 1999.
232. M. Goto and S. Hayamizu. A real-time music scene description system: Detect-
ing melody and bass lines in audio signals. In Working Notes of the IJCAI-99
Workshop on Computational Auditory Scene Analysis, pp. 31-40, Stockholm,
Sweden, 1999.
233. M. Goto and K. Hirata. Invited review "Recent studies on music information
processing". Acoustical Science and Technology (edited by the Acoustical So-
ciety of Japan), 25(6):419-425, November 2004.
234. M. Goto and H. Muraoka. Issues in evaluating beat tracking systems. In Work-
ing Notes of the IJCAI-97 Workshop on Issues in AI and Music, pp. 9-17,
August 1997.
235. M. Goto and Y. Muraoka. A beat tracking system for acoustic signals of music.
In ACM International Conference on Multimedia, pp. 365-372, San Pransisco,
California, October 1994.
404 References
236. M. Goto and Y. Muraoka. A sound source separation system for percussion in-
struments. Transactions of the Institute of Electronics, Information and Com-
munication Engineers D-II, J77-D-II(5):901-911, May 1994. (in Japanese)
237. M. Goto and Y. Muraoka. Music understanding at the beat level: Real-time
beat tracking for audio signals. In International Joint Conference on Artificial
Intelligence, pp. 68-75, Montreal, Quebec, 1995.
238. M. Goto and Y. Muraoka. A real-time beat tracking system for audio signals.
In International Computer Music Conference, Tokyo, Japan, 1995.
239. M. Goto and Y. Muraoka. Real-time rhythm tracking for drumless audio sig-
nals: Chord change detection for musical decisions. In Working Notes of the
IJCAI-97 Workshop on Computational Auditory Scene Analysis, pp. 135-144,
Nagoya, Japan, 1997.
240. M. Goto and Y. Muraoka. Real-time beat tracking for drumless audio signals:
Chord change detection for musical decisions. Speech Communication, 27(3-
4):311-335, 1999.
241. M. Goto, M. Tabuchi, and Y. Muraoka. An automatic transcription system
for percussion instruments. In the 4^th National Convention of Information
Processing Society of Japan, 7Q-2, Tokyo, Japan, March 1993. (in Japanese)
242. F. Gouyon. Towards Automatic Rhythm Description of Musical Audio Sig-
nals: Representations, Computational Models and Applications. Master's the-
sis, UPF, Barcelona, 2003.
243. F. Gouyon and S. Dixon. A review of automatic rhythm description systems.
Computer Music Journal, 29(1), 2005.
244. F. Gouyon, L. Fabig, and J. Bonada. Rhythmic expressiveness transformations
of audio recordings: swing modifications. In International Conference on Digital
Audio Effects, 2003.
245. F. Gouyon and P. Herrera. Exploration of techniques for automatic labeling of
audio drum tracks' instruments. In MOSART Workshop on Current Research
Directions in Computer Music, Barcelona, Spain, 2001.
246. F. Gouyon and P. Herrera. Determination of the meter of musical audio signals:
Seeking recurrences in beat segment descriptors. In Audio Engineering Society
114th Convention, Amsterdam, Netherlands, March 2003.
247. F. Gouyon, P. Herrera, and P. Cano. Pulse-dependent analysis of percussive
music. In 22nd International Audio Engineering Society Conference, Espoo,
Finland, 2002.
248. F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzanetakis, C. Uhle, and
P. Cano. An experimental comparison of audio tempo induction algorithms.
IEEE Transactions on Speech and Audio Processing, 2005. (in press).
249. F. Gouyon and B. Meudic. Towards rhythmic content processing of musical
signals: Fostering complementary approaches. Journal of New Music Research,
32(1), 2003.
250. F. Gouyon, F. Pachet, and O. Delerue. On the use of zero-crossing rate for an
application of classification of percussive sounds. In International Conference
on Digital Audio Effects, Verona, Italy, December 2000.
251. P.J. Green. On use of the EM algorithm for penalized likehhood estimation.
Journal of the Royal Statistical Society: Series B, 52:443-452, 1990.
252. P.J. Green. Penalized likehhood. In Encyclopaedia of Statistical Sciences, Vol-
ume 3, pp. 578-586, 1999.
253. J.M. Grey. Multidimensional perceptual scaling of musical timbres. Journal of
the Acoustical Society of America, 61:1270-1277, 1977.
References 405
254. J.M. Grey and J.A. Moorer. Perceptual evaluations of synthesized musical
instrument tones. Journal of the Acoustical Society of America, 62:454-462,
1977.
255. R. Gribonval. Approximations Non-lineaires pour VAnalyse des Signaux
Sonores. PhD thesis, Universite de Paris IX Dauphine, 1999.
256. R. Gribonval and E. Bacry. Harmonic decomposition of audio signals with
matching pursuit. IEEE Transactions on Signal Processing, 51(1):101-111,
2003.
257. R. Gribonval, E. Bacry, S. Mallat, Ph. Depalle, and X. Rodet. Analysis of
sound signals with high resolution matching pursuit. In IEEE Symposium on
Time-Frequency and Time-Scale Analysis, pp. 125-128, Paris, France, June
1996.
258. R. Gribonval, L. Benaroya, E. Vincent, and C. Fevotte. Proposals for perfor-
mance measurement in source separation. In the 4th International Symposium
on Independent Component Analysis and Blind Signal Separation, Nara, Japan,
2003.
259. D. Griffin and J. Lim. Signal estimation from modified short-time fourier trans-
form. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32:236-
242, 1984.
260. T.D. Griffiths, J.D. Warren, S.K. Scott, I. Nelken, and A.J. King. Cortical
processing of complex sound: A way forward? Trends in Neuroscience, 27:181-
185, 1977.
261. K. Grochenig. Foundations of Time-frequency Analysis. Birkhauser, Boston,
MA, 2001.
262. I. Guyon and A. Elisseeff. An introduction to variable and feature selection.
Journal of Machine Learning Research, 3:1157-1182, 2003.
263. S.W. Hainsworth. Techniques for the Automated Analysis of Musical Audio.
PhD thesis. Department of Engineering, University of Cambridge, 2004.
264. S.W. Hainsworth and M.D. Macleod. Automatic bass line transcription from
polyphonic music. In International Computer Music Conference, pp. 431-434,
Havana, Cuba, 2001.
265. S.W. Hainsworth and M.D. Macleod. Onset detection in musical audio signals.
In International Computer Music Conference, Singapore, 2003.
266. S.W. Hainsworth and M.D. Macleod. Particle filtering applied to musical tempo
tracking. Journal of Applied Signal Processing, 15:2385-2395, 2004.
267. M.A. Hall. Correlation-Based Feature Selection for Machine Learning. PhD
thesis. Department of Computer Science, University of Waikato, Hamilton,
New Zealand, 1998.
268. M.A. Hall. Correlation-based feature selection for discrete and numeric class
machine learning. In Seventeeth International Conference on Machine Learn-
ing, Stanford, CA, USA, 2000.
269. K.N. Hamdy, A. AH, and A.H. Tewfik. Low bit rate high quality audio coding
with combined harmonic and wavelet representations. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, Volume 2, pp. 1045-
1048, Atlanta, USA, 1996.
270. S. Handel. Listening: An Introduction to the Perception of Auditory Events.
MIT Press, 1989.
271. S. Handel. Timbre perception and auditory object identification. In Moore
[474], pp. 425-460.
406 References
289. W.J. Hess. Pitch Determination of Speech Signals. Springer, Berlin Heidelberg,
1983.
290. W.J. Hess. Pitch and voicing determination. In Furui and Sondhi [203], pp. 3 -
48.
291. M.J. Hewitt and R. Meddis. An evaluation of eight computer models of mam-
malian inner hair-cell function. Journal of the Acoustical Society of America,
90(2):904-917, 1991.
292. T.W. Hilands and S.C.A. Thomopoulos. Nonlinear filtering methods for har-
monic retrieval and model order selection in Gaussian and non-Gaussian noise.
IEEE Transactions on Signal Processing, 45(4):982-995, April 1997.
293. J. Holland. Practical Percussion: A Guide to the Instruments and Their
Sources. Oxford University Press, 2001.
294. H. Honing. From time to time: The representation of timing and tempo. Com-
puter Music Journal, 25(3):50-61, 2001.
295. R.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University
Press, Cambridge, UK, 1994.
296. E. V. Hornbostel and C. Sachs. The classification of musical instruments. Galpin
Society Journal, pp. 3-29, 1961.
297. A.J.M. Houtsma. Pitch perception. In Moore [474], pp. 267-295.
298. A.J.M. Houtsma and J.L. Goldstein. The central origin of the pitch of complex
tones: Evidence from musical interval recognition. Journal of the Acoustical
Society of America, 51(2):520-529, 1972.
299. P. Hoyer. Non-negative sparse coding. In IEEE Workshop on Networks for
Signal Processing XII, Martigny, Switzerland, 2002.
300. P.O. Hoyer. Non-negative matrix factorization with sparseness constraints.
Journal of Machine Learning Research, 5:1457-1469, 2004.
301. N. Hu and R.B. Dannenberg. A comparison of melodic database retrieval
techniques using sung queries. In ACM/IEEE Joint Conference on Digital Li-
braries, pp. 301-307, Oregon, USA, 2002.
302. A. Hyvarinen. Fast and robust fixed-point algorithms for independent compo-
nent analysis. IEEE Transactions on Neural Networks, 10(3):626-634, 1999.
303. A. Hyvarinen and P. Hoyer. Emergence of phase and shift invariant features
by decomposition of natural images into independent feature subspaces. Neural
Computation, 12(7): 1705-1720, 2000.
304. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis.
John Wiley & Sons, 2001.
305. Vamtech Enterprises Inc. Drumtrax 3.0. Buffalo, New York, USA, 1999.
306. H. Indefrey, W. Hess, and G. Seeser. Design and evaluation of double-transform
pitch determination algorithms with nonlinear distortion in the frequency
domainpreliminary results. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. 415-418, Tampa, Florida, 1985.
307. International Organization for Standardization. ISO/IEC 15938-4:2002 In-
formation Technology - Multimedia Content Description Interface - Part 4-
Audio. International Organization for Standardization, Geneva, Switzerland,
2002.
308. R. Irizarry. Local harmonic estimation in musical sound signals. Journal of the
American Statistical Association, 96(454) :357-367, June 2001.
309. R. Irizarry. Weighted estimation of harmonic components in a musical sound
signal. Journal of Time Series Analysis, 23(l):29-48, 2002.
408 References
310. R.A. Irizarry. Statistics and Music: Fitting a Local Harmonic Model to Musical
Sound Signals. PhD thesis, University of Cahfornia, Berkeley, 1998.
311. F. Jaillet. Representations et Traitement Temps-frequence des Signaux Au-
dionumeriques pour des Applications de Design Sonore. PhD thesis, LATP
and Universite de Provence, Marseille, 2005.
312. F. Jaillet and B. Torresani. Time-frequency jigsaw puzzle: Adaptive multiwin-
dow and multilayered gabor expansions. Technical report, LATP and Univer-
site de Provence, Marseille, 2004. (submitted)
313. A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: A review.
IEEE Transactions on Pattern Analysis and Machine Intelligence^ 22:4-37,
2000.
314. G.-J. Jang and T.-W. Lee. A maximum likelihood approach to single channel
source separation. Journal of Machine Learning Research, 23:1365-1392, 2003.
315. H. Jarvelainen, V. Valimaki, and M. Karjalainen. Audibility of the timbral ef-
fects of inharmonicity in stringed instrument tones. Acoustics Research Letters
Online, 2(3):79-84, 2001.
316. F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge,
Massachusetts, 1997.
317. F. Jelinek and R.L. Mercer. Interpolated estimation of Markov source para-
meters from sparse data. In Intemation Workshop on Pattern Recognition in
Practice, Amsterdam, The Netherlands, May 1980.
318. K. Jensen and T.H. Andersen. Beat estimation on the beat. In IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA,
2003.
319. K. Jensen and J. Arnspang. Binary decission tree classification of musical
sounds. In International Computer Music Conference, Beijing, China, 1999.
320. I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, USA,
1986.
321. D. Jurafsky and J.H. Martin. Speech and Language Processing. Prentice Hall,
New Jersey, USA, 2000.
322. C. Kaernbach and L. Demany. Psychophysical evidence against the autocorre-
lation theory of auditory temporal processing. Journal of the Acoustical Society
of America, 104(4):2298-2306, 1998.
323. T. Kageyama, K. Mochizuki, and Y. Takashima. Melody retrieval with hum-
ming. In International Computer Music Conference, pp. 349-351, Tokyo,
Japan, 1993.
324. H. Kameoka, T. Nishimoto, and S. Sagayama. Separation of harmonic struc-
tures based on tied Gaussian mixture model and information criterion for con-
current sounds. In IEEE International Conference on Acoustics, Speech, and
Signal Processing, Montreal, Canada, 2004.
325. I. Kaminskyj and T. Czaszejko. Automatic recognition of isolated monophonic
musical instrument sounds using KNNC. Journal of Intelligent Information
Systems, 24:199-221, 2005.
326. A. Kapur, M. Benning, and G. Tzanetakis. Query-by-beat-boxing: Music re-
trieval for the DJ. In International Conference on Music Information Retrieval,
pp. 170-177, Barcelona, Spain, October 2004.
327. M. Karjalainen and T. Tolonen. Multi-pitch and periodicity analysis model for
sound separation and auditory scene analysis. In IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing, Phoenix, USA, 1999.
References 409
364. B. Kostek and A. Czyzewski. Representing musical instrument sounds for their
automatic classification. Journal of the Audio Engineering Society, 49:768-785,
2001.
365. B. Kostek, M. Dziubinski, and P. Zwan. Further developments of methods for
searching optimum musical and rhythmic feature vectors. In Audio Engineering
Society 21st International Conference, St. Petersburg, Russia, 2002.
366. B. Kostek and R. Krolikowski. Application of artificial neural networks to the
recognition of musical sounds. Archives of Acoustics, 22:27-50, 1997.
367. B. Kostek, P. Szczuko, P. Zwan, and P Dalka. Processing of musical data
employing rough sets and artificial neural networks. In Transactions on Rough
Sets III, pp. 112-133, 2005.
368. B. Kostek and A. Wieczorkowska. Study of parameter relations in musical
instrument patterns. In Audio Engineering Society 100th Convention, Copen-
hangen, Denmark, 1996.
369. B. Kostek and A. Wieczorkowska. Parametric representation of musical sounds.
Archives of Acoustics, 22:3-26, 1997.
370. B. Kostek and P. Zwan. Wavelet-based automatic recognition of musical instru-
ment classes. In 142nd Meeting of the Acoustical Society of America, Melville,
New York, 2001.
371. B. Kostek, P. Zwan, and M. Dziubinski. Statistical analysis of musical sound
features derived from wavelet representation. In Audio Engineering Society
112nd Convention, Munich, Germany, 2002.
372. J.R. Koza. Genetic Programming: On the Programming of Computers by Means
of Natural Selection. MIT Press, 1992.
373. K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Se-
jnowski. Dictionary learning algorithms for sparse representation. Neural Com-
putation, 15:349-396, 2003.
374. J. Krimphoff, S. McAdams, and S. Winsberg. Caracterisation du timbre des
sons complexes. II: Analyses acoustiques et quantification psychophysique.
Journal de Physique, 4:625-628, 1994.
375. A. Krishna and T. Sreenivas. Music instrument recognition: Prom isolated notes
to solo phrcises. In IEEE International Conference on Acoustics, Speech, and
Signal Processing, Montreal, Canada, 2004.
376. S. Krstulovic, R. Gribonval, P. Leveau, and L. Daudet. A comparison of two
extensions of the matching pursuit algorithm for the harmonic decomposition
of sounds. In IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics, New Paltz, USA, 2005.
377. C. Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University
Press, 1990.
378. C.L. Krumhansl. Why is musical timbre so hard to understand? In
S. Nielzenand and O. Olsson, editors. Structure and Perception of Electroa-
coustic Sound and Music, pp. 43-53. Elsevier Academic Press, Amsterdam,
1989.
379. L. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. Wiley,
2004.
380. N. Kunieda, T. Shimamura, and J. Suzuki. Robust method of measurement of
fundamental frequency by ACLOSautocorrelation of log spectrum. In IEEE
International Conference on Acoustics, Speech, and Signal Processing, pp. 232-
235, Atlanta, USA, 1996.
412 References
381. J. Tin-Yau Kwok. Moderating the outputs of support vector machine classifiers.
IEEE Transactions on Neural Networks, 10(5):1018-1031, September 1999.
382. T.I. Laakso, V. VaUmaki, M. Karjalainen, and U.K. Laine. Splitting the unit
delay: Tools for fractional delay filter design. IEEE Signal Processing Magazine,
13(l):30-60, 1996.
383. M. Lagrange, S. Marchand, and J.B. Rault. Using linear prediction to en-
hance the tracking of partials. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, Montreal, Canada, 2004.
384. M. Lahat, R. Niederjohn, and D.A. Krubsack. Spectral autocorrelation method
for measurement of the fundamental frequency of noise-corrupted speech. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 6:741-750, June
1987.
385. S. Lakatos. A common perceptual space for harmonic and percussive timbres.
Perception and Psychophysics, 62:1426-1439, 1994.
386. T.L. Lam. Beat Tracking. Master's thesis. Department of Engineering, Univer-
sity of Cambridge, 2003.
387. D. Lang and N. de Frietas. Beat tracking the graphical model way. In Neural
Information Processing Systems, Vancouver, Canada, 2004.
388. K. Lange. Numerical Analysis for Statisticians. Springer, New York, USA,
1999.
389. E.W. Large. Dynamic Representation of Musical Structure. PhD thesis, Ohio
State Univ., 1994.
390. E.W. Large. Beat tracking with a nonUnear oscillator. In International Joint
Conference on Artificial Intelligence, pp. 24-31, Stockholm, Sweden, 1995.
391. E.W. Large and J.F. Kolen. Resonance and the perception of musical meter.
Connection Science, 6(l):177-208, 1994.
392. J. Laroche. Estimating tempo, swing and beat locations in audio recordings. In
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics,
pp. 135-138, 2001.
393. J. Laroche. Efficient tempo and beat tracking in audio recordings. Journal of
the Audio Engineering Society, 51(4):226-233, April 2003.
394. J. Laroche, Y. Stylianou, and E. Mouhnes. HNS: Speech modification based
on a harmonic + noise model. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, Volume 2, pp. 550-553, Minneapolis, USA, April
1993.
395. H. Laurent and C. Doncarli. Stationarity index for abrupt changes detection
in the time-frequency plane. IEEE Signal Processing Letters, 5(2):43-45, 1998.
396. S. Lauritzen and D. Spiegelhalter. Local computations with probabilities on
graphical structures and their application to expert systems. Journal of the
Royal Statistical Society: Series B, 50(2): 157-224, 1988.
397. C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. Prentice Hall,
Englewood CHffs, New Jersey, 1974.
398. C.S. Lee. The perception of metrical structure: Experimental evidence and
a model. In P. Howell, R. West, and I. Cross, editors, Representing Musical
Structure. Academic Press, London, 1991.
399. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix
factorization. Nature, 401:788-791, October 1999.
400. D.D. Lee and H.S. Seung. Algorithms for non-negative matrix factorization. In
Neural Information Processing Systems, pp. 556-562, Denver, USA, 2001.
References 413
401. T. Lee and R. Orglmeister. A contextual blind separation of delayed and con-
volved sources. In IEEE International Conference on Acoustics, Speech, and
Signal Processing, pp. 1199-1202, Munich, Germany, 1997.
402. M. Leman. Music and Schema Theory. Springer, Heidelberg, 1995.
403. P. Lepain. Polyphonic pitch extraction from musical signals. Journal of New
Music Research, 28(4):296-309, 1999.
404. F. Lerdahl and R. Jackendoff. A Generative Theory of Tonal Music. MIT Press,
1983.
405. M. Lesaffre, M. Leman, B. De Baets, and J.-P. Martens. Methodological con-
siderations concerning manual annotation of musical audio in function of al-
gorithm development. In International Conference on Music Information Re-
trieval, Barcelona, Spain, 2004.
406. M. Lesaffre, M. Leman, K. Tanghe, B. De Baets, H. De Meyer, and
J.P. Martens. User-dependent taxonomy of musical features as a conceptual
framework for musical audio-mining technology. In Stockholm Music Acoustics
Conference, Stockholm, Sweden, 2003.
407. V. Lesser, S.H. Nawab, I. Gallastegi, and F. Klassner. IPUS: An architecture for
integrated signal processing and signal interpretation in complex environments.
In 11th National Conference on Artificial Intelligence, pp. 249-255, 1993.
408. S. Levine. Audio Representations for Data Compression and Compressed Do-
main Processing. PhD thesis. Center for Computer Research in Music and
Acoustics, Stanford University, 1998.
409. J.C.R. Licklider. A duplex theory of pitch perception. Experientia, 7:128-133,
1951. Reproduced in Schubert, E.D. (Ed.): Psychological Acoustics (Benchmark
papers in acoustics, vol. 13), Stroudsburg, Pennsylvania: Dowden, hutchinson
&; Ross, Inc., pp. 155-160.
410. T.M. Little and F.J. Hills. Statistical Methods in Agricultural Research. Uni-
versity of CaUfornia Press, 1972.
411. B. Liu, Y. Wu, and Y. Li. Linear hidden Markov model for music information
retrieval based on humming. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, Volume 5, pp. 533-536, Hong Kong, China,
2003.
412. M. Liu and C. Wan. Feature selection for automatic classification of musi-
cal instrument sounds. In ACM/IEEE Joint Conference on Digital Libraries,
Roanoke, VA, USA, 2001.
413. A. Livshin, G. Peeters, and X. Rodet. Studies and improvements in automatic
classification of musical sound samples. In International Computer Music Con-
ference, Singapore, 2001.
414. A. Livshin and X. Rodet. The importance of cross database evaluation in
sound classification. In International Conference on Music Information Re-
trieval, Baltimore, USA, 2003.
415. A. Livshin and X. Rodet. Instrument recognition beyond separate notes: In-
dexing continuous recordings. In International Computer Music Conference,
Miami, Florida, USA, 2004.
416. A. Livshin and X. Rodet. Musical instrument identification in continuous
recordings. In International Conference on Digital Audio Effects, Naples, Italy,
2004.
417. B. Logan and S. Chu. Music summarization using key phrases. In IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing, Volume 2,
pp. 749-752, Istanbul, Turkey, 2000.
414 References
418. H.C. Longuet-Higgins and C.S. Lee. The perception of musical rhythms. Per-
ception, 11 (2): 115-128, 1982.
419. M.A. Loureiro, H.B. de Paula, and H.C. Yehia. Timbre classification of a sin-
gle musical instrument. In International Conference on Music Information Re-
trieval, Barcelona, Spain, 2004.
420. L. Lu, M. Wang, and H.-J. Zhang. Repeating pattern discovery and structure
analysis from acoustic music data. In ACM SIGMM International Workshop
on Multimedia Information Retrieval, pp. 275-282, 2004.
421. L. Lu, H. You, and H.-J. Zhang. A new approach to query by humming in
music retrieval. In IEEE International Conference on Multimedia and Expo,
pp. 776-779, Tokyo, Japan, 2001.
422. R.F. Lyon. Computational models of neural auditory processing. In IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing, pp. 36.1.1-
36.1.4, San Diego, California, 1984.
423. A. Madevska-Bogdanova and D. Nikohc. A geometrical modification of SVM
outputs for pattern recognition. In 22nd International Conference on Informa-
tion Technology Interfaces, pp. 165-170, Pula, Croatia, June 2000.
424. A. Madevska-Bogdanova and D. Nikolic. A new approach of modifying SVM
outputs. In International Joint Conference on Neural Networks, Volume 6,
pp. 395-398, Como, Italy, July 2000.
425. R. Maher and J. Beauchamp. An investigation of vocal vibrato for synthesis.
Applied Acoustics, 30:219-245, 1990.
426. R.C. Maher. An Approach for the Separation of Voices in Composite Music
Signals. PhD thesis, Univ. of Illinois, Urbana, 1989.
427. R.C. Maher. Evaluation of a method for separating digitized duet signals. Jour-
nal of the Audio Engineering Society, 38(12):956-979, 1990.
428. R.C. Maher and J.W. Beauchamp. Fundamental frequency estimation of mu-
sical signals using a two-way mismatch procedure. Journal of the Acoustical
Society of America, 95(4):2254-2263, April 1994.
429. S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1998.
430. S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.
IEEE Transactions on Signal Processing, 41:3397-3415, 1993.
431. H.S. Malvar. Signal Processing with Lapped Transforms. Artech House, Nor-
wood, MA, 1992.
432. B.S. Manjunath, P. Salembier, and T. Sikora. Introduction to MPEG-7: Mul-
timedia Content Description Language. John Wiley & Sons, 2002.
433. M. Marolt. SONIC: Transcription of polyphonic piano music with neural net-
works. In MOSART Workshop on Current Research Directions in Computer
Music, Barcelona, Spain, November 2001.
434. M. Marolt. A connectionist approach to transcription of polyphonic piano mu-
sic. IEEE Transactions on Multimedia, 6(3):439-449, 2004. URL: Igm.fri.uni-
Ij.si/SONIC.
435. M. Marolt. Gaussian mixture models for extraction of melodic lines from au-
dio recordings. In International Conference on Music Information Retrieval,
Barcelona, Spain, October 2004.
436. M. Marolt. On finding melodic lines in audio recordings. In International Con-
ference on Digital Audio Effects, Naples, Italy, 2004.
437. J. Marques and P.J. Moreno. A study of musical instrument classification using
Gaussian mixture models and support vector machines. Technical Report CRL
99/4, Compaq, 1999.
References 415
496. T.H. Park and P. Cook. Nearest centroid error clustering for radial/elliptical
basis function neural networks in timbre classification. In International Com-
puter Music Conference, pp. 833-866, Barcelona, Spain, 1998.
497. R. Parncutt. A perceptual model of pulse salience and metrical accent in mu-
sical rhythms. Music Perception, ll(4):409-464, 1994.
498. T.W. Parsons. Separation of speech from interfering speech by means of har-
monic selection. Journal of the Acoustical Society of America, 60(4), 1976.
499. R.D. Patterson. Auditory filter shapes derived with noise stimuli. Journal of
the Acoustical Society of America, 59(3):640-654, 1976.
500. R.D. Patterson. Auditory images: How complex sounds are represented in the
auditory system. Journal of the Acoustical Society of Japan (E), 21(4):183-190,
2000.
501. R.D. Patterson and M.H. Allerhand. Time-domain modeling of peripheral audi-
tory processing: A modular architecture and a software platform. Journal of the
Acoustical Society of America, 98(4): 1890-1894, 1995. URL: http://www.mrc-
cbu.cam.ac.uk/cnbh/web2002/ bodyframes/AIM.htm.
502. R.D. Patterson and J. Holdsworth. A functional model of neural activity pat-
terns and auditory images. In Ainsworth [13], pp. 551-567.
503. J. Paulus and A. Klapuri. Measuring the similarity of rhythmic patterns. In
International Conference on Music Information Retrieval, Paris, France, 2002.
504. J. Paulus and A. Klapuri. Model-based event labeling in the transcription of
percussive audio signals. In M. Davies, editor. International Conference on
Digital Audio Effects, pp. 73-77, London, UK, September 2003.
505. J. Paulus and T. Virtanen. Drum transcription with non-negative spectrogram
factorisation. In European Signal Processing Conference, Antalya, Turkey, Sep-
tember 2005.
506. J.K. Paulus and A.P. Klapuri. Conventional and periodic N-grams in the tran-
scription of drum sequences. In IEEE International Conference on Multimedia
and Expo, Volume 2, pp. 737-740, Baltimore, Maryland, USA, July 2003.
507. Steffen Pauws. CubyHum: A fully operational query by humming system. In
International Conference on Music Information Retrieval, pp. 187-196, Paris,
France, October 2002.
508. Z. Pawlak. Rough set elements. In L. Polkowski and A. Skowron, editors. Rough
Sets in Knowledge Discovery, pp. 10-30. Physica-Verlag, Heidelberg, 1998.
509. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann Pubhshers, 1988.
510. G. Peeters. Automatic classification of large musical instrument databases us-
ing hierarchical classifiers with inertia ratio maximization. In Audio Engineer-
ing Society 115th Convention, New York, NY, USA, 2003.
511. G. Peeters. A large set of audio features for sound description (similarity
and classification) in the CUIDADO project. Technical report, IRC AM, Paris,
France, April 2004.
512. G. Peeters, A. La Burthe, and X. Rodet. Toward automatic music audio sum-
mary generation from signal analysis. In International Conference on Music
Information Retrieval, pp. 94-100, Paris, France, October 2002.
513. G. Peeters, S. McAdams, and P. Herrera. Instrument sound description in the
context of MPEG-7. In International Computer Music Conference, pp. 166-
169, Berhn, Germany, 2000.
References 419
514. G. Peeters and X. Rodet. Automatically selecting signal descriptors for sound
classification. In International Computer Music Conference, Goteborg, Sweden,
September 2002.
515. G. Peeters and X. Rodet. Hierarchical Gaussian tree with inertia ratio maxi-
mization for the classification of large musical instrument databases. In Inter-
national Conference on Digital Audio Effects, London, UK, 2003.
516. G. Peeters and X. Rodet. Signal-based music structure discovery for music au-
dio summary generation. In International Computer Music Conference, pp. 15-
22, Singapore, 2003.
517. I. Peretz. Music perception and recognition. In B. Rapp, editor. The Handbook
of Cognitive Neuropsychology, pp. 519-540. Hove: Psychology Press, 2001.
518. I. Peretz and M. Coltheart. Modularity of music processing. Nature Neuro-
science, 6(7), 2003.
519. G. Peterschmitt, E. Gomez, and P. Herrera. Pitch-based solo location. In
MOSART Workshop on Current Research Directions in Computer Music,
Barcelona, Spain, 2001.
520. M. Piszczalski. A Computational Model of Music Transcription. PhD thesis,
Univ. of Michigan, Ann Arbor, 1986.
521. M. Piszczalski and B.A. Galler. Automatic music transcription. Computer Mu-
sic Journal, 1(4):24-31, 1977.
522. C.J. Plack, A.J. Oxenham, R.R. Fay and A.N. Popper, editors. Pitch. Springer,
New York, 2005.
523. C.J. Plack and R.P. Carlyon. Loudness perception and intensity coding. In
Moore [474], pp. 123-160.
524. J.C. Piatt. Probabilistic outputs for support vector machines and comparisons
to regularized likelihood methods. In A.J. Smola, P. Bartlett, B. Scholkopf,
and D. Schuurmans, editors. Advances in Large Margin Classifiers. MIT Press,
1999.
525. M.D. Plumbley. Conditions for non-negative independent component analysis.
IEEE Signal Processing Letters, 9(6), 2002.
526. M.D. Plumbley and E. Oja. A 'non-negative PC A' algorithm for indepen-
dent component analysis. IEEE Transactions on Neural Networks, 15(l):66-67,
2004.
527. H.F. Pollard and E.V. Janson. A tristimulus method for the specification of
musical timbre. Acustica, 51:162-171, 1982.
528. E. PoUastri. A pitch tracking system dedicated to process singing voice for
musical retrieval. In IEEE International Conference on Multimedia and Expo,
Volume 1, pp. 341-344, Lusanne, Switzerland, 2002.
529. D-J. Povel and P. Essens. Perception of musical patterns. Music Perception,
2(4):411-440, Summer 1985.
530. E. Prame. Vibrato extent and intonation in professional Western lyric singing.
Journal of the Acoustical Society of America, 102(1):616-621, July 1997.
531. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical
Recipes in C/C-h-h; The Art of Scientific Computing. Cambridge University
Press, Cambridge, UK, 2002.
532. H. Purnhagen and N. Meine. HILN: The MPEG-4 parametric audio coding
tools. In the IEEE International Symposium on Circuits and Systems (ISCAS
2000), Geneva, Switzerland, 2000.
533. J.R. Quinlan. CJi.-5: Programs for Machine Learning. Morgan Kaufmann Pub-
lishers, San Francisco, CA, USA, 1993.
420 References
534. L.R. Rabiner. A tutorial on hidden Markov models and selected applications
in speech recognition. Proc. of IEEE, 77(2):257-289, February 1989.
535. L.R. Rabiner, M.J. Cheng, A.E. Rosenberg, and C.A. McGonegal. A compar-
ative performance study of several pitch detection algorithms. IEEE Transac-
tions on Acoustics, Speech, and Signal Processing, 24(5):399-418, 1976.
536. L.R. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice
Hall, New Jersey, 1993.
537. C. Raphael. Automated rhythm transcription. In 2nd Annual International
Symposium on Music Information Retrieval, Bloomington, Indiana, USA, 2001.
538. C. Raphael. A probabilistic expert system for automatic musical accompani-
ment. Journal of Computational and Graphical Statistics, 10(3):486-512, 2001.
539. M. Reyes-Gomez, N. Jojic, and D. Ellis. Deformable spectrograms. In 10th
International Workshop on Artificial Intelligence and Statistics, pp. 285-292,
Barbados, 2005.
540. E. Rich and K. Knight. Artificial Intelligence. McGraw-Hill, New York, 1991.
541. S. Richardson and P.J. Green. On Bayesian analysis of mixtures with an un-
known number of components. Journal of the Royal Statistical Society: Series
B, 59(4):731-792, 1997.
542. C. Roads. Research in music and artificial intelligence. ACM Computing Sur-
veys, 17(2): 163-190, 1985.
543. C. Roads. The Computer Music Tutorial. MIT Press, Cambridge, USA, 1996.
544. C.P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New
York, USA, 2000.
545. A. Ron and Z. Shen. Frames and stable bases for shift-invariant subspaces of
L^{R'^). Canadian Journal of Mathematics, 47:1051-1094, 1995.
546. D. Rosenthal. Emulation of human rhythm perception. Computer Music Jour-
nal, 16(l):64-72, Spring 1992.
547. D.F. Rosenthal. Machine Rhythm: Computer Ennulation of Human Rhythm
Perception. PhD thesis, Massachusetts Institute of Technology, 1992.
548. J. Rosenthal. A First Look at Rigorous Probability Theory. World Scientific
Publishing Company, 2000.
549. M.J. Ross, H.L. Shaffer, A. Cohen, R. Freudberg, and H.J. Manley. Av-
erage magnitude difference function pitch extractor. IEEE Transactions on
Acoustics, Speech, and Signal Processing, 22:353-362, 1974.
550. T.D. Rossing. The Science of Sound. Addison Wesley, second edition, 1990.
551. T.D. Rossing. Science of Percussion Instruments. World Scientific Publishing
Company, 2000.
552. C. Rover, F. Klefenz, and C. Weihs. Identification of Musical Instruments by
Means of the Hough-Transformation. Springer, 2005.
553. R. Rowe. Machine Musicianship. MIT Press, Cambridge, Massachusetts, 2001.
554. S. Roweis. One microphone source separation. In T.K. Leen, T.G. Dietterich,
and V. Tresp, editors, Neural Information Processing Systems, pp. 793-799,
Denver, USA, 2000.
555. D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representa-
tions by error propagation. In D.E. Rumelhart, J.L. McClelland, and the PDP
Research Group, editors. Parallel Distributed Processing: Explorations in the
Microstructure of Cognition, pp. 318-362. MIT Press, Cambridge, MA, 1986.
556. S. Riiping. A simple method for estimating conditional probabilities for SVMs.
Technical report, CS Department, Dortmund University, Dortmund, Germany,
December 2004.
References 421
576. X. Serra and J.O. Smith. Spectral modeling synthesis: A sound analy-
sis/synthesis system based on a deterministic plus stochastic decomposition.
Computer Music Journal, 14(4): 12-24, Winter 1990.
577. W.A. Sethares, R.D. Morris, and J.C. Sethares. Beat tracking of musical per-
formances using low-level audio features. IEEE Transactions on Speech and
Audio Processing, 13(2): 1063-1076, 2005.
578. W.A. Sethares and T.A. Staley. Meter and periodicity in musical performance.
Journal of New Music Research, 30(2), June 2001.
579. F. Sha and L.K. Saul. Real-time pitch determination of one or more voices by
nonnegative matrix factorization. In Neural Information Processing Systems,
Vancouver, Canada, 2004.
580. G. Shafer. A Mathematical Theory of Evidence. Princeton University Press,
1976.
581. R.V. Shannon, F.-G. Zeng, V. Kamath, J. Wygonski, and M. Ekelid. Speech
recognition with primarily temporal cues. Science, 270(5234):303-304, 1995.
582. J.M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients.
IEEE Transactions on Signal Processing, 41(12):3445-3462, 1993.
583. A. Sheh and D.P.W. Ellis. Chord segmentation and recognition using EM-
trained hidden Markov models. In International Conference on Music Infor-
mation Retrieval, pp. 183-189, Baltimore, USA, October 2003.
584. R.N. Shepard. Circularity in judgments of relative pitch. Journal of the Acousti-
cal Society of America, 36(12):2346-2353, 1964.
585. J. Shifrin, B. Pardo, C. Meek, and W. Birmingham. HMM-based musical query
retrieval. In ACM/IEEE Joint Conference on Digital Libraries, pp. 295-300,
Oregon, USA, 2002.
586. H. Shih, S.S. Narayanan, and C.-C.J. Kuo. A statistical multidimensional hum-
ming transcription using phone level hidden Markov models for query by hum-
ming systems. In IEEE International Conference on Multimedia and Expo,
Volume 1, pp. 61-64, Baltimore, Maryland, USA, 2003.
587. H.-H. Shih, S.S. Narayanan, and C.-C.J. Kuo. An HMM-based approach to
humming transcription. In IEEE International Conference on Multimedia and
Expo, Lusanne, Switzerland, 2002.
588. H.-H. Shih, S.S. Narayanan, and C.-C.J. Kuo. Multidimensional humming
transcription using a statistical approach for query by humming systems. In
IEEE International Conference on Acoustics, Speech, and Signal Processing,
Volume 5, pp. 541-544, Hong Kong, China, 2003.
589. J.I. Shonle and K.E. Horan. The pitch of vibrato tones. Journal of the Acousti-
cal Society of America, 67(l):246-252, January 1980.
590. J. Sillanpaa, A. Klapuri, J. Seppanen, and T. Virtanen. Recognition of acoustic
noise mixtures by combined bottom-up and top-down processing. In European
Signal Processing Conference, 2000.
591. M. Slaney. An efficient implementation of the Patterson Holdsworth audi-
tory filter bank. Technical Report 35, Perception Group, Advanced Technology
Group, Apple Computer, 1993.
592. M. Slaney. A critique of pure audition. In International Joint Conference on
Artificial Intelligence, pp. 13-18, Montreal, Quebec, 1995.
593. M. Slaney. Mixtures of probability experts for audio retrieval and indexing.
In IEEE International Conference on Multimedia and Expo, Lusanne, Switzer-
land, 2002.
References 423
594. M. Slaney and R.F. Lyon. A perceptual pitch detector. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, pp. 357-360, Albu-
querque, New Mexico, 1990.
595. M. Slaney and R.F. Lyon. On the importance of timea temporal represen-
tation of sound. In M. Cooke, S. Beet, and M. Crawford, editors, Visual Rep-
resentations of Speech Signals, pp. 95-116. John Wiley Sz Sons, 1993.
596. D. Sleator and D. Temper ley. Melisma music analyser code., 2001.
http://www.link.cs.cmu.edu/music-analysis/.
597. D. Slezak, R Synak, A. Wieczorkowska, and J. Wroblewski. KDD-based ap-
proach to musical instrument sound recognition. In M.-S. Hacid, Z.W. Ras,
D.A. Zighed, and Y. Kodratoff, editors. International Symposium on Method-
ologies for Intelligent Systems, Volume 2366 of Lecture Notes in Artificial In-
telligence, pp. 28-36. Springer, 2002.
598. P. Smaragdis. Redundancy Reduction for Computational Audition, a Unifying
Approach. PhD thesis, Massachusetts Institute of Technology, 2001.
599. P. Smaragdis. Discovering auditory objects through non-negativity constraints.
In ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio
Processing, Jeju, Korea, 2004.
600. P. Smaragdis and J.C. Brown. Non-negative matrix factorization for poly-
phonic music transcription. In IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, New Paltz, USA, 2003.
601. L.M. Smith and P. Kovesi. A continuous time-frequency approach to repre-
senting rhythmic strata. In 4ih Int. Conf. on Music Perception and Cognition,
Montreal, Canada, 1996.
602. B. Snyder. Music and Memory. MIT Press, Cambridge, Massachusetts, 2000.
603. J. Song, S.Y. Bae, and K. Yoon. Mid-level music melody representation of
polyphonic audio for query-by-humming system. In International Conference
on Music Information Retrieval, pp. 133-139, Paris, France, October 2002.
604. T. Sonoda, M. Goto, and Y. Muraoka. A WWW-based melody retrieval system.
In International Computer Music Conference, pp. 349-352, Michigan, USA,
1998.
605. T. Sonoda, T. Ikenaga, K. Shimizu, and Y. Muraoka. The design method of a
melody retrieval system on parallelized computers. In International Conference
on Web Delivering of Music, pp. 66-73, Darmstadt, Germany, December 2002.
606. A. Srinivasan, D. Sullivan, and I. Fujinaga. Recognition of isolated instrument
tones by conservatory students. In 7th Int. Conf. on Music Perception and
Cognition, pp. 720-723, Sydney, Austraha, 2002.
607. M.J. Steedman. The perception of musical rhythm and metre. Perception,
6(5):555-569, 1977.
608. D. Van Steelant, K. Tanghe, S. Degroeve, M. Baets, B. De Leman, and
J.-P. Martens. Classification of percussive sounds using support vector ma-
chines. In Machine Learning Conference of Belgium and The Netherlands,
Brussels, Belgium, January 2004.
609. A. Sterian, M.H. Simoni, and G.H. Wakefield. Model-based musical transcrip-
tion. In International Computer Music Conference, Beijing, China, 1999.
610. A. Sterian and G.H. Wakefield. Music transcription systems: From sound to
symbol. In Workshop on AI and Music, 2000.
611. A.D. Sterian. Model-Based Segmentation of Time-Frequency Images for Mu-
sical Transcription. PhD thesis, MusEn Project, University of Michigan, Ann
Arbor, 1999.
424 References
612. J.V. Stone, J. Porrill, C. Buchel, and K. Priston. Spatial, temporal, and spa-
tiotemporal independent component analysis of fMRI data. In the 18th Leeds
Statistical Research Workshop on Spatial-Temporal Modelling and its Applica-
tions, 1999.
613. J. Sundberg. The Science of the Singing Voice. Northern Illinois University
Press, 1987.
614. J. Sundberg. The perception of singing. In D. Deutsch, editor, The Psychology
of Music, pp. 171-214. Academic Press, 1999.
615. P. Szczuko, P. Dalka, M. Dabrowski, and B. Kostek. MPEG-7-based low-level
descriptor effectiveness in the automatic musical sound classification. In Audio
Engineering Society 116th Convention, Berlin, Germany, 2004.
616. J. Tabrikian, S. Dubnov, and Y. Dickalov. Maximum a posteriori probability
pitch tracking in noisy environments using harmonic model. IEEE Transactions
on Speech and Audio Processing, 12(l):76-87, 2004.
617. H. Takeda, T. Nishimoto, and S. Sagayama. Rhythm and tempo recognition
of musical performance from a probabilistic approach. In International Con-
ference on Music Information Retrieval, Barcelona, Spain, October 2004.
618. D. Talkin. A robust algorithm for pitch tracking. In Kleijn and Paliwal [355],
pp. 495-517.
619. A.S. Tanguiane. Artificial Perception and Music Recognition. Springer, Berlin
Heidelberg, 1993.
620. T. Tarvainen. Automatic Drum. Track Transcription from Polyphonic Mu-
sic. Master's thesis. Department of Computer Science, University of Helsinki,
Helsinki, Finland, May 2004.
621. D. Temperley. The Cognition of Basic Musical Structures. MIT Press, Cam-
bridge, Massachusetts, 2001.
622. D. Temperley and D. Sleator. Modeling meter and harmony: A preference-rule
approach. Computer Music Journal, 23(1): 10-27, Spring 1999.
623. M. Tervaniemi and K. Hugdahl. Lateralization of auditory-cortex functions.
Brain Research Reviews, 43(3):231-46, 2003.
624. D. Thompson, editor. Concise Oxford English Dictionary. Clarendon Press, 9
edition, 1995.
625. H.D. Thornburg, R.J. Leistikow, and J. Berger. Melody retrieval and musi-
cal onset detection from the STFT. IEEE Transactions on Speech and Audio
Processing, 2005. (submitted)
626. P. Toiviainen. An interactive MIDI accompanist. Computer Music Journal,
22(4):63-75, Winter 1998.
627. T. Tolonen and M. Karjalainen. A computationally efficient multipitch analy-
sis model. IEEE Transactions on Speech and Audio Processing, 8(6):708-716,
2000.
628. T. Tolonen, V. Valimaki, and M. Karjalainen. Evaluation of modern sound
synthesis methods. Technical Report 48, Helsinki University of Technology,
March 1998.
629. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector
machine learning for interdependent and structured output spaces. In Interna-
tional Conference on Machine Learning, Banff, Canada, 2004.
630. G. Tzanetakis. Song-specific bootstrapping of singing voice structure. In IEEE
International Conference on Multimedia and Expo, Sorrento (Naples), Italy,
2004.
References 425
631. G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE
Transactions on Speech and Audio Processing^ 10(5):293-302, July 2002.
632. G. Tzanetakis, G. Essl, and P. Cook. Automatic musical genre classification of
audio signals. In 2nd Annual International Symposium on Music Information
Retrieval, 2001.
633. G. Tzanetakis, G. Essl, and P. Cook. Human perception and computer extrac-
tion of beat strength. In International Conference on Digital Audio Effects,
pp. 257-261, 2002.
634. C. Uhle, C. Dittmar, and T. Sporer. Extraction of drum tracks from polyphonic
music using independent subspace analysis. In the 4th International Symposium
on Independent Component Analysis and Blind Signal Separation, Nara, Japan,
2003.
635. C. Uhle and J. Herre. Estimation of tempo, micro time and time signature
from percussive music. In International Conference on Digital Audio Effects,
London, UK, 2003.
636. M. Unoki and M. Akagi. A method of signal extraction from noisy signal based
on auditory scene analysis. Speech Communication, 27(3):261-279, 1999.
637. L. Van Immerseel and J.P. Martens. Pitch and voiced/unvoiced determina-
tion with an auditory model. Journal of the Acoustical Society of America,
91(6):3511-3526, June 1992.
638. S.V. Vaseghi. Advanced Digital Signal Processing and Noise Reduction. John
Wiley & Sons, 1996.
639. S.V. Vaseghi. Advanced Digital Signal Processing and Noise Reduction,
pp. 270-290. Wiley, 2nd edition, July 2000.
640. R. Ventura-Miravet, F. Murtagh, and J. Ming. Pattern recognition of musi-
cal instruments using hidden Markov models. In Stockholm Music Acoustics
Conference, Stockholm, Sweden, 2003.
641. T. Verma and T. Meng. Extending spectral modeUng synthesis with transient
modeling synthesis. Computer Music Journal, 24(2):47-59, 2000.
642. M. Vetterli and J. Kovacevic. Wavelets and Subband Coding. Prentice Hall,
Englewood Cliffs, NJ, USA, 1995.
643. T. Viitaniemi, A. Klapuri, and A. Eronen. A probabilistic model for the tran-
scription of single-voice melodies. In 2003 Finnish Signal Processing Sympo-
sium, pp. 59-63, Tampere, Finland, May 2003.
644. E. Vincent. Modeles d'Instruments pour la Separation de Sources et la Tran-
scription d'Enregistrements Musicaux. PhD thesis, IRCAM and Universite
Pierre et Marie Curie, Paris, 2004.
645. E. Vincent. Musical source separation using time-frequency source priors. IEEE
Transactions on Acoustics, Speech, and Signal Processing, 2005. special issue
on Statistical and Perceptual Audio Processing, to appear.
646. E. Vincent and M.D. Plumbley. A prototype system for object coding of musical
audio. In IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, New Paltz, USA, October 2005.
647. E. Vincent and X. Rodet. Instrument identification in solo and ensemble music
using independent subspace analysis. In International Conference on Music
Information Retrieval, Barcelona, Spain, 2004.
648. E. Vincent and X. Rodet. Music transcription with ISA and HMM. In the
5th International Symposium on Independent Component Analysis and Blind
Signal Separation, 2004.
426 References
649. T. Virtanen. Audio Signal Modeling with Sinusoids Plus Noise. Technical re-
port, Tampere University of Technology, Department of Information Technol-
ogy, 2000. Master's thesis.
650. T. Virtanen. Sound source separation using sparse coding with temporal conti-
nuity objective. In International Computer Music Conference, Singapore, 2003.
651. T. Virtanen. Separation of sound sources by convolutive sparse coding. In
ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio
Processing, Jeju, Korea, 2004.
652. T. Virtanen and A. Klapuri. Separation of harmonic sound sources using sinu-
soidal modeling. In IEEE International Conference on Acoustics, Speech, and
Signal Processing, Volume 2, pp. 765-768, Istanbul, Turkey, 2000.
653. T. Virtanen and A. Klapuri. Separation of harmonic sounds using hnear models
for the overtone series. In IEEE International Conference on Acoustics, Speech,
and Signal Processing, Orlando, USA, 2002.
654. A. Viterbi. Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm. IEEE Transactions on Information Theory,
13(2):260-269, April 1967.
655. G.H. Wakefield. Mathematical representation of joint time-chroma distribu-
tions. In SPIE Conference on Advanced Signal Processing Algorithms, Archi-
tectures, and Implementations, pp. 637-645, 1999.
656. P.J. Walmsley. Signal Separation of Musical Instruments: Simulation-Based
Methods for Musical Signal Decomposition and Transcription. PhD thesis. De-
partment of Engineering, University of Cambridge, September 2000.
657. P.J. Walmsley, S.J. Godsill, and P.J.W. Rayner. Multidimensional optimisa-
tion of harmonic signals. In European Signal Processing Conference, Island of
Rhodes, Greece, September 1998.
658. P.J. Walmsley, S.J. Godsill, and P.J.W. Rayner. Polyphonic pitch tracking us-
ing joint Bayesian estimation of multiple frame parameters. In IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA,
October 1999.
659. C. Wang, R. Lyu, and Y. Chiang. A robust singing melody tracker using adap-
tive round semitones (ARS). In 3rd International Symposium on Image and
Signal Processing and Analysis, pp. 549-554, Rome, Italy, 2003.
660. Y. Wang. A beat-pattern based error concealment scheme for music delivery
with burst packet loss. In IEEE International Conference on Multimedia and
Expo, 2001.
661. Y. Wang and M. Vilermo. A compressed domain beat detector using mp3 audio
bitstreams. In ACM International Multimedia Conference, Ottawa, Canada,
2001.
662. R.M. Warren. Perceptual restoration of missing speech sounds. Science,
167:392-393, 1970.
663. M. Weintraub. A computational model for separating two simultaneous talkers.
In IEEE International Conference on Acoustics, Speech, and Signal Processing,
pp. 81-84, Tokyo, Japan, 1986.
664. J. Wellhausen and H. Crysandt. Temporal audio segmentation using MPEG-7
descriptors. In SPIE Storage and Retrieval for Media Databases 2003, Vol-
ume 5021, pp. 380-387, 2003.
665. D. Wessel. Timbre space as a musical control structure. Computer Music Jour-
nal, 3:45-52, 1979.
References 427
683. K. Yoshii, M. Goto, and H.G. Okuno. Drum sound identification for polyphonic
music using template adaptation and matching methods. In ISC A Tutorial
and Research Workshop on Statistical and Perceptual Audio Processing, Jeju,
Korea, October 2004.
684. T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H.G. Okuno. Automatic
chord transcription with concurrent recognition of chord symbols and bound-
aries. In International Conference on Music Information Retrieval, pp. 100-
105, Barcelona, Spain, October 2004.
685. S.J. Young, N.H. Russell, and J.H.S. Thornton. Token passing: A simple con-
ceptual model for connected speech recognition systems. Technical report, De-
partment of Engineering, University of Cambridge, July 1989.
686. B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from de-
cision trees and naive Bayesian classifiers. In International Conference on Ma-
chine Learning, pp. 202-209, Williamstown, Massachusetts, USA, June 2001.
687. R.J. Zatorre, P. Belin, and V.B. Penhune. Structure and function of auditory
cortex: Music and speech. TRENDS in Cognitive Sciences, 6(l):37-46, 2002.
688. T. Zhang. Instrument classification in polyphonic music based on timbre analy-
sis. In SPIE, Internet Multimedia Management Systems II, pp. 136-147, 2001.
689. Y. Zhu and M.S. Kankanhalli. Robust and efficient pitch tracking for query-
by-humming. In 2003 Joint Conference of the Fourth International Conference
on Information, Communications and Signal Processing, 2003 and the Fourth
Pacific Rim Conference on Multimedia, Volume 3, pp. 1586-1590, Singapore,
December 2003.
690. M. Zibulski and Y. Zeevi. Analysis of multiwindow Gabor-type schemes by
frame methods. Applied and Computational Harmonic Analysis, 4(2): 188-212,
1997.
691. A. Zils. Extraction de Descripteurs Musicaux: Une Approche Volutionniste.
PhD thesis, Sony CSL Paris and Laboratoire d'Informatique de I'universit Paris
6, 2004.
692. A. Zils and F. Pachet. Automatic extraction of music descriptors from acoustic
signals using eds. In Audio Engineering Society 116th Convention, Berlin, Ger-
many, 2004.
693. A. Zils, F. Pachet, O. Delerue, and F. Gouyon. Automatic extraction of drum
tracks from polyphonic music signals. In International Conference on Web
Delivering of Music, Darmstadt, Germany, December 2002.
694. E. Zwicker and H. Fasti. Psychoacoustics: Facts and Models. Springer, 1999.
Index
Kalman filter, 51, 52, 117, 121, 221-223, Markov chain Monte Carlo, 43, 117,
226, 324 121, 215
Kernel, 57 Markov tree, 90
Gaussian, 57 Masking, 236
positive definite, 57 Matching pursuit, 84
Kullback-Leibler divergence, 282 Matrix diagonalization, 54
symmetric, 287 Matrix factorization, see Non-negative
Kullback-Leibler information, 335 matrix factorization
Kurtosis, 275 Maximum a posteriori estimation, 40,
spectal, 136 121, 214, 279, 331
Maximum likelihood estimation, 33, 35
Language model, 15 Mbira, 167
Laplace approximation, 225 McNemar's test, 171
Laplacian distribution, 280 Mean, 29, 30
Latent variable, 35 empirical, 54, 61
Law of large numbers, 30 Mean square error of an estimator, 34
Lazy learning, 185 Measure
Least-squares, 285 musical measure, see Bar line
Leave-one-out cross-validation, 170 musical measure estimation, see Bar
Lebesgue measure, 28 line estimation
Legato, 9 Mechanical-to-neural transduction, 238
in singing, 367 Mel frequency cepstral coefficients, 26
Level adaptation in auditory model, Mel frequency scale, 26
239, 246 Mel-frequency cepstral coefficients, 63,
Level compression, see Compression 135, 174
Likelihood, 210, 219, 332 delta-MFCC, 175
Likelihood function, 32 mel-frequency cepstral coefficients, 270
degenerate, 33, 38 Mel-frequency scale, 173
penalized, 38, 41, 57 Mel-scale filterbank, 26
Linear discriminant analysis, 186, 191 mel-scale filterbank, 173
Linear interpolation, 253 Melodic phrase, see Phrase
Linear prediction, 39 Melody, 9, 12, 13, 329
Linear programming, 286 perceptual coherence, 15
Local cosine basis, 73 segregation of melodic lines, 247
Localized source model, 142 transcription, see Predominant FO
Locally harmonic sources, 69 estimation
Log-Gaussian distribution, 224 Membranophone, 131, 145, 147, 167
Loss function, 55 Memory for music, 11, 13
Loudness, 8, 172 Message passing, 221
of instrument sounds, 319 Metadata, see Annotation
of melody vs. accompaniment, 340 Metre, 10, 105, 312, 329, 341
of singing, 366 Metre analysis, 101, 134, 341, 388
Metre perception, 10, 11, 102
Mallet percussion instrument, 232 MFCC, see Mel frequency cepstral
Marginal MMSE, 214 coefficients
Marginal probability density function, Mid-level data representation, 12, 13,
31 65, 244, 248, 251, 256, 264
Marimba, 167, 232 desirable qualities, 14
Markov chain, see A^-gram model hybrid, see Hybrid representation
Index 435