Pert Usa PHD
Pert Usa PHD
thesis
Computationally efficient methods
for polyphonic music transcription
Committee members:
Xavier Serra (Universitat Pompeu Fabra, Barcelona, Spain)
Gérard Assayag (IRCAM, Paris, France)
Anssi Klapuri (Queen Mary University, London, UK)
José Oncina (Universidad de Alicante, Spain)
Isabel Barbancho (Universidad de Málaga, Spain)
A Teima
Acknowledgments
First and foremost, I would like to thank all members of the computer music lab
from the University of Alicante for providing an excellent, inspiring, and pleasant
working atmosphere. Especially, to the head of the group and supervisor of this
work, Prof. José Manuel Iñesta. His encouraging scientific spirit provides an
excellent framework for inspiring the new ideas that make us to continuously
grow and advance. I own this work to his advice, support and help.
Carrying out a PhD is not an easy task without the help of so many people.
First, I would like to thank all the wonderful staff of our GRFIA group, and
in general, all the DLSI department from the University of Alicante. My
research periods at the Audio Research Group from the Tampere University of
Technology, the Music Technology Group from the Universitat Pompeu Fabra,
and the Department of Software Technology and Interactive Systems from the
Vienna University of Technology, also contributed decisively to make this work
possible. I have learned much, as a scientist and as a person, from the wonderful
and nice researchers of all these labs.
I would also thank to the people who directly contributed to this work.
I am grateful to Dr. Francisco Moreno for delaying some of my teaching
responsibilities when this work was in progress, and for supplying the kNN
algorithms code. I learned most of the signal processing techniques needed for
music transcription from Prof. Anssi Klapuri. I’ll always be very grateful for the
great period in Tampere and his kind hosting. He directly contributed to this
dissertation providing the basis for the sinusoidal likeness measure code, and also
the multiple f0 databases that allowed to evaluate and improve the proposed
algorithms. Thanks must also go to one of my undergraduate students, Jasón
Box, which collaborated to this work building the ODB database and migrating
the onset detection code from C++ into D2K.
I wish to express my gratitude to the referees of this dissertation, for kindly
accepting the review process, and to the committee members.
This work would not have been possible without the primary support
provided by the Spanish PROSEMUS project1 and the Consolider Ingenio
2010 MIPRCV research program2 . It has also been funded by the Spanish
CICYT projects TAR3 and TIRIG4 , and partially supported by European
Union-FEDER funds and the Generalitat Valenciana projects GV04B-541 and
GV06/166.
Beyond research, I would like to thank my family and my friends (too many
to list here, you know who you are). Although they don’t exactly know what
1 Code TIN2006-14932-C02
2 Code CSD2007-00018
3 Code TIC2000-1703-CO3-02
4 Code TIC2003-08496-C04
v
I am working on and will never read a boring technical report in English, their
permanent understanding and friendship have actively contributed to keep my
mind alive within this period.
Finally, this dissertation is dedicated to the most important person in my
life, Teima, for her love, support, care and patience during this period.
vi
Contents
1 Introduction 1
2 Background 7
2.1 Analysis of audio signals . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Fourier transform . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Time-frequency representations . . . . . . . . . . . . . . . 11
2.1.3 Filters in the frequency domain . . . . . . . . . . . . . . . 15
2.2 Analysis of musical signals . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Taxonomy of musical instruments . . . . . . . . . . . . . . 20
2.2.4 Pitched musical sounds . . . . . . . . . . . . . . . . . . . 21
2.2.5 Unpitched musical sounds . . . . . . . . . . . . . . . . . . 24
2.2.6 Singing sounds . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Music background . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Tonal structure . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Modern music notation . . . . . . . . . . . . . . . . . . . 33
2.3.4 Computer music notation . . . . . . . . . . . . . . . . . . 34
2.4 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Nearest neighbors . . . . . . . . . . . . . . . . . . . . . . 40
3 Music transcription 43
3.1 Human music transcription . . . . . . . . . . . . . . . . . . . . . 43
3.2 Multiple fundamental frequency estimation . . . . . . . . . . . . 45
3.2.1 Harmonic overlap . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Beating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Onset detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . 53
vii
CONTENTS
5 Onset detection 83
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.2 Onset detection functions . . . . . . . . . . . . . . . . . . 85
5.1.3 Peak detection and thresholding . . . . . . . . . . . . . . 89
5.2 Evaluation with the ODB database . . . . . . . . . . . . . . . . . 89
5.2.1 Results using o[t] . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Results using õ[t] . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 MIREX evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.1 Methods submitted to MIREX 2009 . . . . . . . . . . . . 93
5.3.2 MIREX 2009 onset detection results . . . . . . . . . . . . 95
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
viii
CONTENTS
A Resumen 177
Bibliography 191
ix
List of Figures
xi
LIST OF FIGURES
xii
LIST OF FIGURES
xiii
List of Tables
xv
Introduction
1
Automatic music transcription is a music information retrieval (MIR) task which
involves many different disciplines, such as audio signal processing, machine
learning, computer science, psychoacoustics and music perception, music theory,
and music cognition.
The goal of automatic music transcription is to extract a human readable
and interpretable representation, like a musical score, from an audio signal. A
score is a guide to perform a piece of music, and it can be represented in different
ways. The most extended score representation is the modern notation used in
Western tonal music. In order to extract a readable score from a signal, it is
necessary to estimate the pitches, onset times and durations of the notes, the
tempo, the meter and the tonality of a musical piece.
The most obvious application of automatic music transcription is to help
a musician to write down the music notation of a performance from an audio
recording, which is a time consuming task when it is done by hand. Besides this
application, automatic music transcription can also be useful for other MIR
tasks, like plagiarism detection, artist identification, genre classification, and
composition assistance by changing the instrumentation, the arrangement or
the loudness before resynthesizing new pieces. In general, music transcription
methods can also provide information about the notes to symbolic music
algorithms.
The transcription process can be separated into two main stages: to convert
an audio signal to a piano-roll representation, and to convert the estimated
piano-roll into musical notation.
As pointed out by Cemgil et al. (2003), most authors only consider automatic
music transcription as an audio to piano-roll conversion, whereas piano-roll to
score notation can be seen as a separate problem. This can be justified since
the processes involved in audio to piano-roll notation include pitch estimation
and temporal note segmentation, which constitutes a challenging task itself.
The piano-roll to score process involves tasks like tempo estimation, rhythm
1
1. INTRODUCTION
quantization, key detection or pitch spelling. This stage is more related to the
generation of human readable notation.
In general, a music transcription system can not obtain the exact score
that the musician originally read. Musical audio signals are often expressive
performances, rather than simple mechanical translations of notes read on a
sheet. A particular score can be performed by a musician in different ways.
As scores are only guides for the performers to play musical pieces, converting
the notes present in an audio signal into staff notation is an ill-posed problem
without a unique solution.
However, the conversion of a musical audio signal into a piano-roll represen-
tation without rhythmic information only depends on the waveform. Rather
than a score-oriented representation, a piano-roll can be seen as a sound-
oriented representation which displays all the notes that are playing at each
time. The conversion from an audio file into a piano-roll representation is done
by a multiple fundamental frequency (f0 ) estimation method. This is the main
module of a music transcription system, as it estimates the number of notes
sounding at each time and their pitches.
For converting a piano-roll into a readable score, other harmonic and
rhythmic components must also be taken into account. The tonality is related
with the musical harmony, showing hierarchical pitch relationships based on a
key tonic. Source separation and timbre classification can be used to identify the
different instruments present in the signal, allowing the extraction of individual
scores for each instrument. The metrical structure refers to the hierarchical
temporal structure. It specifies how many beats are in each measure and what
note value constitutes one beat, so bars can be added to the score to make it
readable by a musician. The tempo is a measure to specify how fast or slow is
a musical piece.
A music transcription example is shown in Fig. 1.1. The audio performance
of the score in the top of the figure was synthesized for simplification, and
it did contain neither temporal deviations nor pedal sustains. The piano-roll
inference was done without errors. The key, tempo, and meter estimates can be
inferred from the waveform or from the symbolic piano-roll representation. In
this example, these estimates were also correct (except from the anacrusis1 at
the beginning, which causes the shift of all the bars). However, it can be seen
that the resulting score differs from the original one.
When a musician performs a score, the problem is even more challenging,
as there are frequent temporal deviations, and the onset and duration of the
notes must to be adjusted (quantized) to obtain a readable score. Note that
quantizing temporal deviations implies that the synthesized waveform of the
1 The term anacrusis refers to the note or sequence of notes which precede the beginning
2
A M.me Marie Pleyel
Nocturne
Frédéric Chopin
Op. 9 N. 2
!
Andante ( " =132) #*
12121
% !# ! !3 !4 !3 #
! ! !5 !$ ! !2 ! # !3 !2 !1 # !3 ! !2 #
54 5 4
' % % 12 !#
8 !&
1
& &
1 2 1 1
&
!
espress. dolce & &
!
! !! ! !! ! !!! ! !!! $ ! !!! ! !!! % ! $ !!! ! !!! !
$ ! !! !! !!! ! !!! % ! % !!!
( % % 12 ) ! ! !" ! %
Piano
'%%! - !
4 1
& &
!
&
! !! % $ !! !
f %! ! ! !!! ! !!! ! ! ! ! ! !! ! !!
p !
! ! $ ! !!
cresc.
! ! "! ! ! !" ! % !! !" ! ! !" ! ! " ! ! " ! ! !! ! !! !!
( %% " ! ! ! ! !" !
! !
%! !" !"
" ! " !" " !. " !. " !. " !. " " "
!. " ! .
!. !. !.
'%% !# ! ! ! !# !#
7 1 1
&
!
$ ! !!! ! !!! ! !!! ! % !!! f ! % !!! ! !! ! !!! ! !!! ! !!! ! !!! ! !! ! !!
p pp
( % % !" ! ! ! % ! " ! ! ! " ! ! " ! " ! !
% $ !" !" $ !" !Multiple ! !" ! $! !"
!"
f0 estimation
Tempo I.
!5 #
poco ritard.
$ !4
poco rallent.
!#
4 5
'%%
4
& $!# !#
10 1
! !!! % ! !!! ! !!! ! !!! % ! !!! ! !!! % ! !!! $ ! !!! ! !!! ! !!! !!" $ 0 !!" " " % !!" !!"
f
( %% ! ! ! !" ! ! " ! ! ! 0! $! !
"
% %! !" $ " $ ! !" ! !"
! ! !"
!"
! . !" " !. " "
!. " !. " !. " !. " !. !. "
Creative Commons Attribution-ShareAlike 3.0
! . . .
O YZ . . . . . . . 3. . . . .
& O O ` . . .. . .. . .. . ... P . ... . ... O . P .. . .. . .. . ... . ... . .
1 2
% O O Y`Z E . . . . O . . . . . . . . E . . . . . . . E . E O.
Piano
Inst 2
O . . . . O. P. P.
. . . .
. . . .
4
O P. . . . . !
5
. . . . P. . . . O. . . .
& O O O ... . . .. .
. . P .. . ... . .. .
.
. O . .. ## .
% OO E . . . . . . . . . . . . . . . O . .
. . . .
Figure 1.1: Music transcription example from Chopin (Nocturne, Op. 9, N. 2).
O .
.
. O. P. . .
6
.
O . . O. P. O. . . P. . . 7
. . P ... . . O . .
& O O .. P . .
. . .
P .. O . .. . .. . .. . ... O . ... . 3
. . . E . . . E E
% OO
O . . . .
.
.
. . . O. . P. . P.
O.
. . . . . . ! 9
O P. . . . . . . . . . . . . . . . .
P .. .. .. . .. . .. .
8 10
& O O O .. . . .. . . . .. . .
. . . . . . .. . ..
.
% O E . . . . . . . . . . . . . . E . E . . O .. .. . . . .
OO . . . . P . . . .
. .
O . P . . . .
P. . P .. .. ... .. O O .. . P .. ...
11 12
&OO . . . P. . . . .
1. INTRODUCTION
resulting score would not exactly match the original audio times. This is the
reason why the piano-roll is considered as a sound-oriented representation.
This dissertation is mainly focused on the multiple f0 estimation issue, which
is crucial for music transcription. This is a extremely challenging task which
has been addressed in several doctoral theses, such as Moorer (1975), Maher
(1989), Marolt (2002), Hainsworth (2003), Cemgil (2004), Bello (2004), Vincent
(2004), Klapuri (2004), Zhou (2006), Yeh (2008), Ryynänen (2008), and Emiya
(2008).
Most multiple f0 estimation methods are complex and have high computa-
tional costs. As discussed in chapter 3, the estimation of multiple simultaneous
pitches is a challenging task due to the number of theoretical issues.
The main contributions of this work are a set of novel efficient methods
proposed for multiple fundamental frequency estimation (chapters 6 and 7). The
proposed algorithms have been evaluated and compared with other approaches,
yielding satisfactory results.
The detection of the beginnings of musical events on audio signals, or onset
detection, is also addressed in this work. Onset times can be used for beat
tracking, for tempo estimation, and to refine the detection in a multiple f0
estimation system. A simple and efficient novel methodology for onset detection
is described in chapter 5.
The proposed methods have also been applied to other MIR tasks, like genre
classification, mood classification, and artist identification. The main idea was
to combine audio features with symbolic features extracted from transcribed
audio files, and then use a machine learning classification scheme to yield the
genre, mood or artist. These combined approaches have been published in (Lidy
et al., 2009, 2007, 2008) and they are beyond the scope of this PhD, which is
mainly focused on music transcription itself.
This work is organized as follows. The introductory chapters 2, 3, and 4
describe respectively the theoretical background, the multiple f0 problem, and
the state of the art for automatic music transcription. Then, novel contributions
are proposed for onset detection (5), and multiple fundamental frequency
estimation (6, 7), followed by the overall conclusions and future work (8).
4
Outline
2 - Background. This chapter introduces the theoretical background, defining
the signal processing, music theory, and machine learning concepts that
will be used in the scope of this work.
8 - Conclusions and future work. The conclusions and future work are
discussed in this chapter.
5
Background
2
This chapter describes the signal processing, music theory, and machine learning
concepts needed to understand the basis of this work.
Different techniques for the analysis of audio signals based on the Fourier
transform are first introduced. The properties of musical sounds are presented,
classifying instruments according to their method of sound production and
to their spectral characteristics. Music theory concepts are also addressed,
describing the harmonic and temporal structures of Western music, and how can
it be represented using written and computer notations. Finally, the machine
learning techniques used in this work (neural networks and nearest neighbors)
are also described.
7
2. BACKGROUND
+∞
X
DFTx (k) = X(k) = x(n)e−j2πkn 2.2
n=−∞
In real world, signals have finite length. X[k] is defined in Eq. 2.3 for a
discrete finite signal x[n].
N
X −1
2π
DFTx [k] = X[k] = x[n]e−j N kn , k = 0, . . . , N − 1 2.3
n=0
8
2.1. ANALYSIS OF AUDIO SIGNALS
Im(z )
b z = a + jb
!
|z| = a2 + b2
φ(z) = arctan(b/a)
a Re(z )
Figure 2.1: Complex plane diagram. Magnitude and phase of the complex
number z are shown.
The Shannon theorem limits the number of useful frequencies of the discrete
Fourier transform to the Nyquist frequency (fs /2). The frequency of each
spectral bin k can be easily computed as fk = k(fs /N ) since the N bins are
equally distributed in the frequency domain of the transformed space. Therefore,
the frequency resolution of the DFT is ∆f = fs /N .
The equations above are described in terms of complex exponentials. The
Fourier transform can also be expressed as a combination of sine and cosine
functions, equivalent to the complex representation by the Euler’s formula.
If the number of samples N is a power of two, then the DFT can be efficiently
computed using a fast Fourier transform (FFT) algorithm. Usually, software
packages that compute the FFT, like FFTW3 from Frigo and Johnson (2005),
use Eq. 2.3, yielding an array of complex numbers.
Using complex exponentials, the radial position or magnitude |z|, and the
angular position or phase φ(z) can easily be obtained from the complex value
z = a + jb (see Fig. 2.1).
The energy spectral density (ESD) is the squared magnitude of the DFT of
a signal x[n]. It is often called simply the spectrum of a signal. A spectrum
can be represented as a two-dimensional diagram showing the energy of a signal
|X[k]|2 as a function of frequency (see Fig. 2.2). In the scope of this work, it will
be referred as power spectrum (PS), whereas magnitude spectrum (MS) will be
referred represent the DFT magnitudes |X[k]| as a function of frequency.
Spectra are usually plotted with linear amplitude and linear frequency scales,
but they can also be represented using a logarithmic scale for amplitude,
frequency or both. A logarithmic magnitude widely used to represent the
amplitudes is the decibel.
dB(|X[k]|) = 20 log(|X[k]|) = 10 log(|X[k]|2 ) 2.4
3 Fastest Fourier Transform in the West. http://www.fftw.org
9
2. BACKGROUND
0.3 5000
2
x[n] 4500 |X[k]|
0.2
4000
0.1
3500
3000
0
DFT 2500
-0.1
2000
-0.2 1500
1000
-0.3
500
-0.4
0 500 1000 1500 2000 2500 3000 3500 4000 0
0 50 100 150 200 250 300 350 400 450 500
The analysis of discrete and finite signals presents some limitations. First, the
continuous to discrete conversion process can produce aliasing and quantization
noise. The solution to the aliasing problem is to ensure that the sampling rate
is high enough to avoid any spectral overlap or to use an anti-aliasing filter.
The DFT also introduces drawbacks like spectral leakage and the picket
fence effect. Spectral leakage is an effect where, due to the finite nature of the
analyzed signal, small amounts of energy are observed in frequency components
that do not exist in the original waveform, forming a series of lobes in the
frequency domain.
The picket fence is an effect related to the discrete nature of the DFT
spectrum, which is analogous to looking at it through a sort of picket fence,
since we can observe the exact behavior only at discrete points. Therefore,
there may be peaks in a DFT spectrum that will be measured too low in level,
and valleys that will be measured too high, and the true frequencies where the
peaks and valleys are will not be exactly those indicated in the spectrum.
10
2.1. ANALYSIS OF AUDIO SIGNALS
N
X −1
2π
STFTw
x [k, m] = x[n]w[n − mI]e−j N kn , k = 0, . . . , N − 1 2.5
n=0
where w is the window function, m is the window position index, and I is the
hop size.
11
"3dspectrogram_piano_011.txt" matrix
2. BACKGROUND
70
60
70
50
60
40
50
30
|X[k]| 40 20
30
10
20
0
10
0
60
50
40
30
00
20 k
11 2
2 3 10
3 4
t 4 55 0
Figure 2.3: Magnitude spectrogram for the beginning section of a piano note.
Only the first 60 spectral bins and the first 5 frames are shown. Spectrum at
each frame is projected into a plane.
The hop size of the STFT determines how much the analysis starting time
advances from frame to frame. Like the frame length (window size), the choice
of the hop size depends on the purposes of the analysis. In general, a small hop
produces more analysis points and therefore, smoother results across time, but
the computational cost is proportionately increased.
Choosing a short frame duration in the STFT leads to a good time resolution
and a bad frequency resolution, and a long frame duration results in a good
frequency resolution but a bad time resolution. Time and frequency resolutions
are conjugate magnitudes, which means that ∆f ∝ 1/∆t, therefore they can
not simultaneously have an arbitrary precision. The decision about the length
of the frames in the STFT to get an appropriate balance between temporal and
frequency resolution depends on the application.
12
2.1. ANALYSIS OF AUDIO SIGNALS
∞
X
y[n] = (x ∗ g)[n] = x[k]g[n − k] 2.6
k=−∞
13
2. BACKGROUND
∆f
∆f
∆t ∆t
(a) STFT (b) DWT
∆f ∆f
∆t ∆t
(c) Constant Q (d) Filter bank
Figure 2.5: Time-frequency resolution grids without overlap for STFT, DWT,
constant Q transform from Brown (1991), and a filter bank with 6 bands.
Constant Q transform
Using the Fourier transform, all the spectral bins obtained are equally spaced
by a constant ratio ∆f = fs /N . However, the frequencies of the musical notes
(see section 2.3) are geometrically spaced in a logarithmic scale4 .
The constant Q transform is a calculation similar to the Fourier transform,
but with a constant ratio of frequency to resolution Q. This means that each
spectral component k is separated by a variable frequency resolution ∆fk =
fk /Q.
Brown (1991) proposed a constant Q transform in which the center
frequencies fk can be specified as fk = (2k/b )fmin , where b is the number of
filters per octave and fmin is the minimum central frequency considered. The
transform using Q = 34 is similar (although not equivalent) to a 1/24 octave
filter bank. The constant Q transform for the k-th spectral component is:
1
N [k]−1
2π
w[k, n]x[n]e−j N [k] Qn
X
XQ [k] = 2.7
N [k] n=0
14
2.1. ANALYSIS OF AUDIO SIGNALS
frequency
Figure 2.6: Example of a filter bank with triangular shaped bands arranged in
a logarithmic frequency scale.
where N [k] is the window size (in samples) used to compute the transform of
the frequency k:
N [k] = fs /∆fk = (fs /fk )Q 2.8
The window function w[k, n] used to minimize spectral leakage has the same
shape but a different length for each component. An efficient implementation
of the constant Q transform was described by Brown and Puckette (1992).
The main drawback with this method is that it does not take advantage of
the greater time resolution that can be obtained using shorter windows at high
frequencies, loosing coverage in the time-frequency plane (see Fig. 2.5(c)).
15
2. BACKGROUND
K−1
X
bi = (|X[k]| · |Hi [k]|)2 2.9
k=0
f
Mel(f ) = 2595 log +1 2.10
700
As Huang et al. (2001) points, one Mel represents one-thousandth of the
pitch of 1 kHz, and a doubling of Mels produces a perceptual doubling of pitch.
Other psychoacoustic scale is the Bark introduced by Zwicker et al. (1957),
which partitions the hearing bandwidth into perceptually equal frequency bands
(critical bands). If the distance between two spectral components is less than
the critical bandwidth, then one masks the other.
The Bark scale, also called critical band rate (CBR), is defined so that the
critical bands of human hearing have a width of one bark. This partitioning,
based on the results of psychoacoustic experiments, simulates the spectral
analysis performed by the basilar membrane, in such a way that each point on
the basilar membrane can be considered as a bandpass filter having a bandwidth
equal to one critical bandwidth, or one bark.
A CBR filter bank is composed of a set of critical band filters, each one
corresponding to one bark. The center frequencies to build the filter bank
are described by Zwicker (1961). The Mel filter bank is composed of a set of
filters with a triangular shape and equally spaced in terms of Mel frequencies.
Shannon and Paliwal (2003) showed that Bark and Mel filter banks have similar
performance in speech recognition tasks.
The Mel frequency cepstral coefficients (MFCC) have been extensively used
in tasks such as automatic speech recognition and music processing. To compute
the MFCC features, the power spectrum of the signal is first computed and
apportioned through a Mel filter bank. The logarithm of the energy for each
filter is calculated before applying a Discrete Cosine Transform (DCT, see
Ahmed et al., 1974) to produce the MFCC feature vector. The DCT of a
discrete signal x[n] with a length N is defined as:
N
X −1
π
1
DCTx [i] = x[n] cos i n+ 2.11
n=0
N 2
16
2.2. ANALYSIS OF MUSICAL SIGNALS
3500
3000
2500
2000
Mel scale
1500
1000
500
0
0 2000 4000 6000 8000 10000
Hertz scale
The MFCC are the obtained DCT amplitudes. In most applications, the
dimensionality of the MFCC representation is usually reduced by selecting only
certain coefficients.
The bandwidth of a filter can be expressed using an equivalent rectangular
bandwidth (ERB) measure. The ERB of a filter is defined as the bandwidth of
a perfectly rectangular filter with a unity magnitude response and same area as
that filter. According to Moore (1995), the ERB bandwidths bc of the auditory
filter at the channel c obey this equation:
bc = 0.108fc + 24.7 Hz 2.12
being fc the center frequency of the filter.
A filter with a triangular shape can be useful for some applications, but
other shapes are needed to model the auditory responses. Filter frequency
responses can be expressed in terms of a gaussian function (Patterson, 1982), a
rounded exponential (Patterson et al., 1982), and a gammatone or “Patterson-
Holdsworth” filter (Patterson et al., 1995). Gammatone filters are frequently
used in music analysis, and a description of their design and implementation
can be found in (Slaney, 1993). Auditory filter banks have been used to model
the cochlear processing, using a set of gammatone filters uniformly distributed
in the critical-band scale.
17
2. BACKGROUND
0.04
RMS
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0 20 40 60 80 100 120
2.2.1 Dynamics
Musical instruments produce sounds that evolve in time. The beginning of a
sound is known as the onset time, and its temporal end is the offset time. The
amplitude envelope refers to a temporally smoothed curve of the sound intensity
as a function of time, which evolves from the onset to the offset times.
The envelope of a signal is usually calculated in the time domain by lowpass
filtering (with a 30 Hz cut-off frequency) the root mean square (RMS) levels of
a signal. The RMS levels E[n] can be obtained as:
v
u N −1
u1 X
E[n] = t x2 [n + i] 2.13
N i=0
where N is the size of the frame. Real sounds have a temporal envelope with an
attack and release stages (like percussion or plucked strings), or attack, sustain
and decay segments (like woodwind instruments)5 . The automatic estimation of
the intra-note segment boundaries is an open problem, and it has been addressed
by some authors like Jensen (1999), Peeters (2004), and Maestre and Gómez
(2005).
5 Synthesizers generate amplitude envelopes using attack, decay, sustain and release
(ADSR), but this segmentation is not achievable in real signals, since the decay part is often
not clearly present, and some instruments do not have defined sustain or release parts.
18
2.2. ANALYSIS OF MUSICAL SIGNALS
The attack of a sound is formally defined as the initial interval during which
the amplitude envelope increases. For real sounds, Peeters (2004) considers
attack as the initial interval between the 20% and the 80% of the maximum
value in the signal, to take into account the possible presence of noise.
Transients are fast varying features characterized by sudden bursts of noise,
or fast changes of the local spectral content. During a transient, the signal
evolves in a relatively unpredictable way. A transient period is usually present
during the initial stage of the sound, and it often corresponds to the period
during which the instrument excitation is applied, though in some sounds a
transient can also be present in the release stage.
A vibrato is a periodic oscillation of the fundamental frequency, whereas
tremolo refers to a periodic oscillation in the signal amplitude. In both cases,
this oscillation is of subsonic frequency.
2.2.2 Timbre
In music, timbre is the quality that distinguishes musical instruments. The
American Standards Association (1960) defines timbre as that attribute of
sensation in terms of which a listener can judge that two sounds having the
same loudness and pitch are dissimilar.
From an audio analysis point of view, it is convenient to understand the
characteristics that make an instrument different from others. Timbral features
extracted from the waveform or the spectrum of an instrument are the basis for
automatic instrument classification.
It was shown by Grey (1978) and Wessel (1979) that important timbre
characteristics of the orchestral sounds are attack quality (temporal envelope),
spectral flux (evolution of the spectral distribution over time), and brightness
(spectral centroid).
The spectral flux (SF) is a measure of local spectrum change, defined as:
K−1
X
SF(t) = (X̃t [k] − X̃t−1 [k])2 2.14
k=0
where X̃t [k] and X̃t−1 [k] are the energy normalized Fourier spectra in the current
and previous frames, respectively:
|X[k]|
X̃[k] = PK−1 2.15
k=0 |X[k]|
The spectral centroid (SC) indicates the position of the sound spectral center
of mass, and it is related to the perceptual brightness of the sound. It is
calculated as the weighted mean of the frequencies present in the signal, and
the weights are their magnitudes.
19
2. BACKGROUND
K−1
X
SCX = k X̃[k] 2.16
k=0
20
2.2. ANALYSIS OF MUSICAL SIGNALS
0.1 35
25
0.04
0.02
T0 DFT
20
15
-0.02
-0.04 10
-0.06
5
-0.08
-0.1 0
0 500 1000 1500 2000 0 100 200 300 400 500
1 f0
f0 =
T0
Instruments are classified in the families above depending on its exciter, the
vibrating element that transforms the energy supplied by the player into sound.
However, a complementary taxonomy can be assumed, dividing musical sounds
in two main categories: pitched and unpitched sounds.
Harmonic sounds
21
2. BACKGROUND
from 40 Hz for low-pitched male voices to 600 Hz for children or high-pitched female voices.
22
2.2. ANALYSIS OF MUSICAL SIGNALS
whereas a thinner string under higher tension (such as a treble string in a piano)
or a more flexible string (such as a nylon string used on a guitar or harp) exhibits
less inharmonicity.
According to Fletcher and Rossing (1988), the harmonic frequencies in a
piano string approximately obey this formula:
p
fh = hf0 1 + Bh2 2.17
A typical value of the inharmonicity factor for the middle pitch range of a
piano is B = 0.0004, which is sufficient to shift the 17th partial to the ideal
frequency of the 18th partial.
In some cases, there are short unpitched excerpts in pitched sounds, mainly
in the initial part of the signal. For instance, during the attack stage of wind
instruments, the initial breath noise is present before the pitch is perceived.
Inharmonic sounds are also produced by the clicking of the keys of a clarinet,
the scratching of the bow of violin, or the sound of the hammer of a piano
hitting the string, for instance.
The additive synthesis, that was first extensively described by Moorer (1977),
is the base of the original harmonic spectrum model, which approximates a
harmonic signal by a sum of sinusoids. A harmonic sound can be expressed as
a sum of H sinusoids with an error model :
H
X
x[n] = Ah [n] cos(2πfh n + φh (0)) + [n] 2.18
h=1
23
2. BACKGROUND
subtracted from the original sound and the remaining residual is represented as
a time varying filtered white noise component.
Recent parametric models, like the ones proposed by Verma and Meng (2000)
and Masri and Bateman (1996), extend the SMS model to consider transients.
When sharp onsets occur, the frames prior to an attack transient are similar,
and also the frames following its onset, but the central frame spanning both
regions is an average of both spectra that can be difficult to be analyzed.
Without considering noise or transients, in a very basic form, a harmonic
sound can be described with the relative amplitudes of its harmonics and their
evolution over time. This is also known as the harmonic pattern (or spectral
pattern). Considering only the spectral magnitude of the harmonics at a given
time frame, a spectral pattern p can be defined as a vector containing the
magnitude ph of each harmonic h:
p = {p1 , p2 , ..., ph , ..., pH } 2.19
This partial to partial amplitude profile is also referred to as the spectrum
envelope. Adding the temporal dimension to obtain the spectral evolution in
time, a harmonic pattern can be written in matrix notation:
P = {p1 , p2 , ..., pt , ..., pT } 2.20
In most musical sounds, the first harmonics contain most of the energy of
the signal, and sometimes their spectral envelope can be approximated using a
smooth curve.
Inharmonic sounds
The pitched inharmonic sounds have a period in the time domain and a pitch,
but their overtone partials are not approximately integer multiples of the f0 .
Usually, a vibrating bar is the sound source of these instruments, belonging to
the idiophones family. The most common are the marimba, vibraphone (see
Fig. 2.11), xylophone and glockenspiel.
As the analysis of inharmonic pitched sounds is complex and these instru-
ments are less commonly used, most f0 estimation systems that analyze the
signal in the frequency domain do not handle them appropriately.
24
2.2. ANALYSIS OF MUSICAL SIGNALS
25
2. BACKGROUND
Most of these sounds are characterized by a sharp attack stage, that usually
shows a broad frequency dispersion (see Fig. 2.12). Interestingly, Fitzgerald and
Paulus (2006) comment that although synthetic7 drum sounds tend to mimic
real drums, their spectral characteristics differ considerably from those in real
drums.
Spectral centroid, bandwidth of the spectrum and spectral kurtosis are
features commonly used in unpitched sound classification.
The transcription of unpitched instruments is referred to the identification
of the timbre class and its onset and offset times, as no pitch is present. This
task will not be addressed in the scope of this thesis, which is mainly focused on
the transcription of pitched sounds. For a review of this topic, see (FitzGerald,
2004) and (Fitzgerald and Paulus, 2006).
26
2.3. MUSIC BACKGROUND
Figure 2.13: Example waveform and spectrogram of a singing male voice, vowel
A (file I-471TNA1M from Goto (2003), RWC database).
piece follows basic melodic, harmonic and rhythmic rules to be pleasing to most
listeners.
In this section, some terms related to the music structure in time and
frequency are described, followed by a brief explanation for understanding a
musical score and its symbols. Different score formats and representations
commonly used in computer music are also introduced.
27
2. BACKGROUND
C! D! F! G! A!
D! E! G! A! B!
C D E F G A B
Figure 2.14: Western note names in a piano keyboard. Only one octave is
labeled.
Musical temperaments
A musical note can be identified using a letter (see Fig. 2.14), and an octave
number. For instance, C3 refers to the note C from the third octave. Notes
separated by an octave are given the same note name. The twelve notes in each
octave are called pitch classes. For example, the note C3 belongs to the same
pitch class than C4 .
8 An unison is an interval with a frequency ratio 1:1.
9 This frequency reference for tuning instruments was first adopted as the USA Standard
Pitch in 1925, and it was set as the modern concert pitch in May 1939. Before, a variety of
standard frequencies were used. For example, in the time of Mozart, the pitch A had a value
close to 422 Hz.
28
2.3. MUSIC BACKGROUND
Figure 2.15: Musical major keys (uppercase), and minor keys (lowercase).
The number of alterations and the staff representation are shown. Fig. from
http://en.wikipedia.org/wiki/File:Circle_of_fifths_deluxe_4.svg.
There are 12 pitch classes, but only 7 note names (C,D,E,F,G,A,B). Each
note name is separated by one tone except F from E, and C from B, which have
a one semitone interval. This is because modern music theory is based on the
diatonic scale.
The diatonic scale is a seven note musical scale comprising five whole steps and
two half steps, with the pattern repeating at the octave. The major scale is a
diatonic scale which pattern of intervals in semitones is 2-2-1-2-2-2-1, starting
from a root note10 . For instance, the major diatonic scale with root note C is
built using the white keys of the piano. The natural minor scale has a pattern
of intervals 2-1-2-2-1-2-2.
In tonal music, a scale is an ordered set of notes typically used in a tonality
(also referred as key). The tonality is the harmonic center of gravity of a musical
excerpt. Intervals in the major and minor scales are consonant intervals relative
to the tonic, therefore they are more frequent within a given key context.
29
2. BACKGROUND
Figure 2.16: Harmonics and intervals. The first nine harmonics of middle
C. Their frequencies and nearest pitches are indicated, as well as the Western
tonal-harmonic music intervals. Fig. from Krumhansl (2004).
A musical excerpt can be arranged in a major or a minor key (see Fig. 2.15).
Major and minor keys which share the same signature are called relative.
Therefore, C major is the relative major of A minor, whereas C minor is
the relative minor of E major. The key is established by particular chord
progressions.
are approximate.
30
2.3. MUSIC BACKGROUND
and minor sixth (8:5) are next most consonant. The least consonant intervals
in western harmony are the minor second (16:15), the major seventh (15:8) and
the tritone (45:32).
In music, consonant intervals are more frequent than dissonant intervals.
According to Kosuke et al. (2003), trained musicians find more difficult to
identify pitches of dissonant intervals than those of consonant intervals.
It is hard to separate melody from harmony in practice (Krumhansl, 2004),
but harmonic and melodic intervals are not equivalent. For example, two notes
separated by one octave play the same harmonic rule, although they are not
interchangeable in a melodic line.
The most elemental chord in harmony is the triad, which is a three note
chord with a root, a third degree (major or minor third above the root), and a
fifth degree (major or minor third above the third).
2.3.2 Rhythm
A coherent temporal structure is pleasing to most listeners. Music has a
rhythmic dimension, related to the placement of sounds at given instants and
their accents13 .
In the literature, there exist some discrepancies about the terminology used
to describe rhythm14 . An excellent study of the semantics of the terms used in
computational rhythm can be found in (Gouyon, 2008).
According to Fraisse (1998), a precise, generally accepted definition of
rhythm does not exist. However, as Honing (2001) points out, there seems
to be agreement that the metrical structure, the tempo (tactus) and the timing
are three main rhythmic concepts.
The metrical structure refers to the hierarchical temporal structure, the
tempo indicates how fast or slow is a musical piece, and the timing deviations
that occur in expressive performances are related to the temporal discrepancies
around the metrical grid.
Metrical structure
31
2. BACKGROUND
Figure 2.17: Diagram of relationships between metrical levels and timing. Fig.
from Hainsworth (2003).
to the preferred human foot tapping rate (Klapuri et al., 2006), or to the dance
movements when listening to a musical piece.
A measure constitutes a temporal pattern and it is composed by a number
of beats. In Western music, rhythms are usually arranged with respect to a
time signature. The time signature (also known as meter signature) specifies
how many beats are in each measure and what note value constitutes one beat.
One beat usually corresponds to the duration of a quarter note15 (or crochet) or
an eighth note (or quaver) in musical notation. A measure is usually 2, 3, or 4
beats long (duple, triple, or quadruple), and each beat is normally divided into
2 or 3 basic subdivisions (simple, or compound). Bar division is closely related
to harmonic progressions.
Unfortunately, the perceived beat does not always correspond with the one
written in a time signature. According to Hainsworth (2003), in fast jazz music,
the beat is often felt as half note (or minim), i.e., double of his written rate,
whereas hymns are often notated with the beat given in minims, the double of
the perceived rate.
The tatum16 , first defined by Bilmes (1993), is the lowest level of the metric
musical hierarchy. It is a high frequency pulse that we keep in mind when
perceiving or performing music. An intuitive definition of the tatum proposed
by Klapuri (2003b) refers it as the shortest durational value in music that are
still more than incidentally encountered, i.e., the shortest commonly occurring
time interval. It frequently corresponds to a binary, ternary, or quaternary
subdivision of the musical beat. The duration values of the other notes, with
few exceptions, are integer multiples of the tatum. The tatum is not written
in a modern musical score, but it is a perceptual component of the metrical
structure.
15 Note durations are shown in Fig. 2.1.
16 In honor of Art Tatum.
32
2.3. MUSIC BACKGROUND
Tempo
The tempo (also referred as tactus) indicates the speed of the underlying beat.
It is usually measured in bpm (number of beats per minute), and it is inversely
proportional to the beat period. Having a beat period Tb expressed in seconds,
the tempo can be computed as T = 60/Tb . Like other rhythmic components, it
can vary along a piece of music.
Timing
Usually, when a score is performed by a musician, the onset times of the
played notes do not exactly correspond with those indicated in the score. This
temporal deviation, known as timing deviation (see Fig. 2.17) is frequent in
musical signals. It can be produced either by slight involuntary deviations in
the performance, or by deliberate expressive rhythm alterations, like swing.
In music psychology, emotion component of music is strongly associated
with music expressivity (Juslin et al., 2006). A musical piece can be performed
to produce different emotions in the listener (it can be passionate, sweet,
aggressive, humorous, etc). In the literature, there exist a variety of approaches
to map a song into a psychologically based emotion space, classifying it
according to its mood. Timing and metrical accents are frequently affected
by mood alterations in the performance.
33
2. BACKGROUND
Half Minim
Quarter Crotchet
Eighth Quaver
Sixteenth Semiquaver
Thirty-second Demisemiquaver
Sixty-fourth Hemidemisemiquaver
Table 2.1: Symbols used to represent the most frequent note and rest durations.
or decrease the pitch by one semitone, respectively. Notes with a pitch outside
of the range of the five line staff can be represented using ledger lines, which
provide a single note with additional lines and spaces.
Duration is shown with different note figures (see Fig. 2.1), and additional
symbols such as dots (·) and ties (^). Notation is read from left to right.
A staff begins with a clef, which indicates the pitch of the written notes.
Following the clef, the key signature indicates the key by specifying certain
notes to be flat or sharp throughout the piece, unless otherwise indicated.
The time signature appears after the key signature. Measures (bars) divide
the piece into regular groupings of beats, and the time signatures specify those
groupings.
Directions to the performer regarding tempo and dynamics are added above
or below the staff. In written notation, the term dynamics usually refers to the
intensity of the notes17 . The two basic dynamic indications in music are p or
piano, meaning soft, and f or forte, meaning loud or strong.
In modern notation, lyrics can be written for vocal music. Besides this
notation, others can be used to represent unpitched instruments (percussion
notation) or chord progressions (e.g., tablatures).
34
2.3. MUSIC BACKGROUND
and represented in different ways. Musical software can decode symbolic data
and represent them in modern notation. Software like sequencers can also play
musical pieces in symbolic formats using a synthesizer.
Symbolic formats
35
2. BACKGROUND
Figure 2.19: Equal temperament system, showing their position in the staff,
frequency, note name and MIDI note number. Fig. from Joe Wolfe, University
of South Wales (UNSW), http://www.phys.unsw.edu.au/jw/notes.html.
MIDI messages, along with timing information, can be collected and stored
in a standard MIDI file (SMF). This is the most extended symbolic file
format in computer music. The SMF specification was developed by the MIDI
Manufacturers Association (MMA). Large collections of files in this format can
be found on the web.
The main limitation of MIDI is that there exist musical symbols in modern
notation that can not be explicitly encoded using this format. For example, pitch
names have a different meaning in music, but there is no difference between C]
and D[ in MIDI, as they share the same pitch number. In the literature, there
are a number of pitch spelling algorithms to assign contextually consistent letter
names to pitch numbers according to the local key context. A comparison of
these methods can be found in the Meredith and Wiggins (2005) review.
This is because MIDI is a sound-oriented code. It was designed as a protocol
to control electronic instruments that produce sound, and not initially intended
to represent musical scores. Another example of a sound related code is CSound
score format, which was also designed for the control and generation of sounds.
36
2.3. MUSIC BACKGROUND
37
2. BACKGROUND
2 3 4 5 6 7 8 9 10 11
Z . . 6 . . 7 . . 8 . . 9 . . 10 . # .. . # . 11 D
. . . .
2 3 4 5
&\ . . . . . . .
. .
% Z\ D D . D . . . D
Inst 3
Piano
. .
z
z
{z z
z
{z
z 16 z
z
{z 17 z
z
{z
z 18z
z
{z
z
z z
{z
z19 z
z
{z
z20z
z
z
{z
zzz
z
z
{z
zzz z
{
Figure .2.20:
. .Example . . . (top)
. . .of a. piano-roll . . and. # score
... . # representation
. . . . . (bottom). .
12 13 14 15 21
for & . .
an excerpt. of the
. MIDI file . .# .# . .
RWC-MDB-C-2001-27 from Goto
. .RWC
(2003),
database, W.A. Mozart variations on ‘Ah Vous Dirai-je Maman’, K.265/300e.
.
D . D using
%obtained . DLogic. .Pro. 8. D . D . . . D
.
Figs.
z
zz
z. zz{z
z
z z z
{z
z z zz
{z
z z {z z
z z
{z
z z z
z
{z z z
z
{z z
zz
{z
zzz{z
z
zz{z
z
z z
z
{z
z z
2.422 .Supervised
23
. . 24. learning
. 25. . 26 . # ... . # . 27 - 28 29 . . . . .
& . . . .
. . . . . .
. D .of pairs
D consist D
Supervised learning methods attempt to deduce a function from a training
set.%The training data . of. input- data (usually, . vectors), and
desired outputs. After the learning stage, a supervised learning algorithm can
30 z
z
{z
.z zz
{z
z
z of zzz
{z
zzfunction
z
z
z
{z
z. Pfor
.zz
{z
zz
.any. z
z
z
{z
.z. zz.z
{z
z
z. z
z
{32z
z
{z
z
.z
z
z3z
z
z
{z
zz
z
zz. basic
zz z z z
z
{
. Q . the . . . 31Olearning
. .supervised . . . . . .
predict the value the valid input data. The concepts
& . . . .
to understand methods used in Chapter 6 for multiple
f0 estimation .
are briefly described next. . Q.
%
2.4.1 Neural networks
z
zzzz
zz
z
zzz z z
.{zz
z
.z
zz z zzz
{zz z
zzzz.zzz z.z
z
{z
z
zz
zz
zz 35z z
z z
{zzz
zz
z
zzzz z
zz
{z z
z z { z
zz
z
. . . .perceptron . . or .multilayer
. . (MLP) . .network
. . .neural . D
.
33 34 36
& . introduced by Minsky. and Papert (1969). Citing Duda. et .al. (2000),
The multilayer architecture
was first
multilayer neural networks.implement linear discriminants, but in a space where
D E# D E . . D
the% "" mapped.# The "key .power provided by. such
"
inputs have been nonlinearly
37 z
z
.zz
zz
zzz z
zz
{z
zz
z z { z zzz
zz
zzz
z
z z z
z
{z
z
zdata.
z{ z zz
zzzzzzz
z
{z
z
z
z{ zzzz
zzz
z
{zzz
{z
networks is that they admit fairly simple algorithms where the form of the
. Qcan
nonlinearity . . . from
. .be. learned 38
. .training
. . . . . . . . O. P. O . P. . .
39
&
The. Fig. 2.21 shows an example of a MLP. This sample neural network is
. .
composed by three layers: the input layer (3 neurons), a hidden layer (2 neurons)
and% D layer (3. neurons), connected
output D . weighted Dedges.
by .
z
z
zz
z
z
zz
zzzz
z
z
z{z
zzz
zz{ z
z
zz
z
zz
z
zz
zzz
z
{z
z{ z
z
z
zz
z
zzz z
z
z
{z
z
zzz
z{
38
2.4. SUPERVISED LEARNING
Output values
Output layer
Weigth matrix 2
Hidden layer
Weight matrix 1
Input layer
Input values
As pointed out by Hush and Horne (1993), time-delay neural networks (Waibel,
1989) are considered as non-recurrent dynamic networks, although essentially
are like static nets traversing temporal series. This kind of nets can model
systems where the output y[t] has a dependence of a limited time interval in the
input u[t]:
y(t) = f [u(t − m), ..., u(t), ..., u(t + n)] 2.23
Using this network, time series can be processed as a collection of static
input-output patterns, related in short-term as a function of the width of the
input window. The TDNN architecture is very similar to a MLP, but the main
difference is that the input layer is also fed with information about adjacent
time frames. Each hidden unit accepts input from a restricted (spatial) range
26 Also called activation function.
39
2. BACKGROUND
of positions in the input layer (see Fig. 2.22). A TDNN can be trained with the
same standard backpropagation algorithm used for a MLP.
40
2.4. SUPERVISED LEARNING
*
* * 2NN
3NN
* * 1NN
4NN
27 More information about different metrics used for NN classification can be found in (Duda
41
Music transcription
3
This chapter briefly addresses the human music transcription process, followed
by an analysis of the theoretical issues in automatic transcription from a signal
processing point of view. Finally, the onset detection task is also introduced.
The different metrics used for the evaluation of multiple f0 estimation and onset
detection methods are also discussed.
43
3. MUSIC TRANSCRIPTION
following events. We can identify relative pitch differences, more than absolute
pitches. Another proof of the importance of the musical context is that pitches
can be very hard to identify in a confusing context like, for instance, when two
different songs are heard simultaneously.
Klapuri et al. (2000) performed a test to measure the pitch identification
ability of trained musicians using isolated chords. The results were compared
with those obtained using an automatic transcription system, and only the
two most skilled subjects performed better than the computational approach,
showing that it is not easy to analyze notes out of context.
Hainsworth (2003) proposed a test where trained musicians where asked to
answer how did they perform transcription. A common pattern was found. The
first step was to do a rough structural analysis of the piece, breaking the song
into sections, finding repetitions, and in some cases marking key phrases. Then,
a chord scheme or the bass line were detected, followed by the melody. Finally,
the inner harmonies were heard by repeated listening, building up a mental
representation. According to Hainsworth (2003), “no-one transcribes anything
but simple music in a single pass”.
The auditory scene analysis (ASA) is a term proposed by Bregman (1990)
to describe the process by which the human auditory system organizes sound
into perceptually meaningful elements. In computational analysis, the related
concept is called computational auditory scene analysis (CASA), which is closely
related to source separation and blind signal separation. The key aspects of
the ASA model are segmentation, integration, and segregation. The grouping
principles of ASA can be categorized into sequential grouping cues (those that
operate across time, or segregated) and simultaneous grouping cues (those
that operate across frequency, or integrated). In addition, schemas (learned
patterns) play an important role. Mathematical formalizations to the field of
computational auditory perception have also been proposed, for instance, by
Smaragdis (2001) and Cont (2008).
The main advantage for humans when transcribing music is our unique
capability to identify patterns and our memory, which allows us to predict
future events. Using memory in computational transcription usually implies a
huge computational cost. It is not a problem to include short term memory , but
finding long term repetitions means keeping alive many hypothesis for various
frames, which is a very costly task. Solving certain ambiguities that humans
can do using long-term memory remains as a challenge. An excellent analysis of
prediction, expectation and anticipation of musical events from a psychological
point of view is done by Cont (2008), who also proposes computational
anticipatory models to address several aspects of musical anticipation. Symbolic
music sequencies can also be modeled and predicted in some degree (Paiement
et al., 2008), as they are typically composed of repetitive patterns.
44
3.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
Within a short context, a trained musician can not identify any musical
information when listening to a 50 ms isolated segment of music. With a 100
ms long segment, some rough pitch estimation can be done, but it is still difficult
to identify the instrument. Using longer windows, the timbre becomes apparent.
However, multiple f0 estimation systems that perform the STFT can estimate
the pitches using short frames1 . Therefore, probably computers can do a better
estimate in isolated time frames, but humans can transcribe better within a
wider context.
H
X
x[n] ≈ Ah cos(hω0 n + φh ) + z[n] 3.1
h=1
where ω0 = 2πf0 . The relation in Eq. 3.1 is approximated for practical use, as
the signal can have some harmonic deviations.
A multiple f0 estimation method assumes that there can be more than one
harmonic source in the input signal. Formally, the sum of M harmonic sources
can be expressed as:
Hm
M X
X
x[n] ≈ Am,hm cos(hm ωm n + φm [n]) + z̄[n] 3.2
m=1 hm =1
45
3. MUSIC TRANSCRIPTION
Noise suppression techniques have been proposed in the literature2 to allow the
subtraction of additive noise from the mixture. And the third major issue is
that, besides the source and noise models, in multiple f0 estimation there is a
third model, which is probably the most complex: the interaction between the
sources.
For instance, consider two notes playing simultaneously within an octave
interval. As their spectrum shows the same harmonic locations than the lowest
note playing alone, some other information (such as the energy expected in
each harmonic for a particular instrument) is needed to infer the presence of
two notes. This issue is usually called octave ambiguity.
According to Klapuri (2004), in contrast to speech, the pitch range is wide3
in music, and the sounds produced by different musical instruments vary a lot in
their spectral content. The harmonic pattern of an instrument is also different
from low to high notes. And transients and the interference of unpitched content
in real music has to be addressed too.
On the other hand, in music the f0 values are temporally more stable than
in speech. Citing Klapuri (2004), it is more difficult to track the f0 of four
simultaneous speakers than to perform music transcription of four-voice vocal
music.
As previously discussed in Sec. 2.3.1, consonant intervals are more frequent
than dissonant ones in western music. Therefore, pleasant chords include
harmonic components of different sounds which coincide in frequency (harmonic
overlaps). Harmonic overlaps and beating are the main effects produced by the
interaction model, and they are described below.
46
3.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
K
X
A cos(ωn + φ) = Ak cos(ωn + φk ) 3.3
k=1
Using trigonometric identity, the resulting amplitude (Yeh and Roebel, 2009)
can be calculated as:
v
#2 " K #2
u" K
u X X
A= t Ak cos(φk ) + Ak sin(φk ) 3.4
k=1 k=1
From which the estimated amplitude A of two overlapping partials with the
same frequency, different amplitude, and phase difference φ∆ is:
q
A = A21 + A22 + 2A1 A2 cos(φ∆ ) 3.5
As pointed out by Yeh and Roebel (2009), two assumptions are usually
made for analyzing overlapping partials: the additivity of linear spectrum and
the additivity of power spectrum. The additivity of linear spectrum A = A1 +A2
assume that the two sinusoids
p are in phase, i.e., cos(φ∆ ) = 1. The additivity of
power spectrum A = A21 + A22 is true when cos(φ∆ ) = 0.
According to Klapuri (2003a), if one of the partial amplitudes is significantly
greater than the other, as is usually the case, A approaches the maximum of
the two. Looking at Eq. 3.5, this assumption is closely related to the additivity
of power spectrum, which experimentally (see Yeh and Roebel, 2009) obtains
better amplitude estimates than considering cos(φ∆ ) = 1.
Recently, Yeh and Roebel (2009) proposed an expected overlap model to get
a better estimation of the amplitude when two partials overlap, assuming that
the phase difference is uniformly distributed.
3.2.2 Beating
If two harmonics of different sources do not coincide in frequency4 , but they
have similar amplitude and small frequency difference, interference beats can
be perceived (see Fig. 3.1).
As pointed out by Wood (2008), p. 158, the physical explanation of
dissonance is that we hear unpleasant beats. Beats are periodic variations of
loudness, and the frequency of the beats depend on the frequency difference of
the two tones.
Even when the frequency difference between two partials is not small enough
to produce a perceptive beating, some amount of beating is always produced
between a pair of harmonics, even when they belong to the same source. The
4 For instance, due to slight harmonic deviations.
47
3. MUSIC TRANSCRIPTION
s1 (t)
s2 (t)
Figure 3.1: Interference tones of two sinusoidal signals of close frequencies. Fig.
extracted from http://www.phys.unsw.edu.au/jw/beats.html
48
3.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
45
G3 "dosolsplitnohwr.txt" using ($0*(22100/2048)):1
40
35
30 C3
25
20
15
10
0
0 200 400 600 800 1000 1200 1400 freq (Hz)
beating
frequency
Frame-based evaluation
Within a frame level, a false positive (FP) is a detected pitch which is not
present in the signal, and a false negative (FN) is a missing pitch. Correctly
detected pitches (OK) are those estimated pitches that are also present in the
ground-truth.
A commonly used metric for frame-based evaluation is the accuracy, which
can be defined as:
ΣOK
Acc = 3.6
ΣOK + ΣF P + ΣF N
5 Thisterm refers to the tracking of the f0 estimates along consecutive frames in order to
add a temporal continuity to the detection.
49
3. MUSIC TRANSCRIPTION
ΣOK
Prec = 3.7
ΣOK + ΣF P
ΣOK
Rec = 3.8
ΣOK + ΣF N
The balance between precision and recall, or F-measure, which is commonly
used in string comparison, can be computed as:
The substitution, miss and false alarm errors are defined as follows:
PT
min(Nref (t), Nsys (t)) − Ncorr (t)
t=1
Esubs = PT 3.11
t=1 Nref (t)
6 National Institute of Standards and Technology.
50
3.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
PT
max(0, Nref (t) − Nsys (t))
t=1
Emiss = PT 3.12
t=1 Nref (t)
PT
max(0, Nsys (t) − Nref (t))
t=1
Ef a = PT 3.13
t=1 Nref (t)
Poliner and Ellis (2007a) suggest that, as in the universal practice in the
speech recognition community, this is probably the most adequate measure,
since it gives a direct feel for the quantity of errors that will occur as a proportion
of the total quantity of notes present.
To summarize, three alternative metrics are used in the literature to evaluate
multiple f0 estimation systems within a frame level: accuracy (Eq. 3.6), F-
measure (Eq. 3.9), and total error (Eq. 3.10).
The accuracy is the most widely used metric for frame by frame evaluation.
The main reason for using accuracy instead of F-measure is that an equilibrate
balance between precision and recall is probably less adequate for this task.
Typically, multiple f0 estimation methods obtain higher precision than recall.
This occurs because some analyzed mixtures contain many pitches with
overlapped harmonics that can be masked by the other components. An
experiment was carried out by Huron (1989) to study the limitations in
listeners abilities to identify the number of concurrent sounding voices. The
most frequent type of confusion was the underestimation of the number of
sounds. Some pitches can be present in the signal, but they can become almost
unhearable, and they are also very difficult to detect analytically. For instance,
when trying to listen an isolated 93 ms frame with 6 simultaneous pitches, we
usually tend to underestimate the number of sources.
Note-based evaluation
Instead of counting the errors at each frame and summing the result for all
the frames, alternative metrics have been proposed to evaluate the temporal
continuity of the estimate. Precision, recall, F-measure and accuracy are also
frequently used for note level evaluation. However, it is not trivial to define
what is a correctly detected note, a false positive, and a false negative.
The note-based metric proposed by Ryynänen and Klapuri (2005) considers
that a reference note is correctly transcribed when their pitches are equal,
the absolute difference between their onset times is smaller than a given
onset interval, and the transcribed note is not already associated with another
reference note. Results are reported using precision, recall, and the mean overlap
ratio, which measures the degree of temporal overlap between the reference and
transcribed notes.
51
3. MUSIC TRANSCRIPTION
52
3.3. ONSET DETECTION
t (sec)
Figure 3.3: Example of a guitar sound waveform. The actual onsets are marked
with dashed vertical lines.
sounds like drums, pitched percussive onsets like pianos, and pitched non-
percussive onsets like bowed strings. Unpitched and pitched percussive sounds
produce hard onsets, whereas pitched non-percussive timbres usually generate
soft onsets.
53
State of the art
4
This chapter presents an overview of previous studies for single f0 estimation,
followed by a deeper review and discussion of different multiple f0 estimation
and onset detection methods.
X−1
t+W
ACFx [τ ] = x[k]x[k + τ ] 4.1
k=t
where τ represents the lag value. The peaks of this function correspond to
multiples of the fundamental period. Usually, autocorrelation methods select
the highest non-zero lag peak over a given threshold within a range of lags.
However, this technique is sensitive to formant structures, producing octave
errors. As Hess (1983) points out, some methods like center clipping (Dubnowski
1 Detecting the fundamental frequency in speech signals is useful, for instance, for prosody
analysis. Prosody refers to the rhythm, stress, and intonation of connected speech.
55
4. STATE OF THE ART
et al., 1976), or spectral flattening (Sondhi, 1968) can be used to attenuate these
effects.
The squared difference function (SDF) is a similar approach to measure
dissimilarities, and it has been used by de Cheveigné and Kawahara (2002) for
the YIN algorithm.
X−1
t+W
2
SDFx [τ ] = (x[k] − x[k + τ ]) 4.2
k=t
The main advantage of using the SDF0 function is that it tends to remain
large at low lags, dropping below 1 only where SDF falls below the average.
Basically, it removes dips and lags near zero avoiding super-harmonic errors,
and normalization makes the function independent of the absolute signal level.
An absolute threshold is set, choosing the first local minimum of SDF0 below
that threshold. If none is found, the global minimum is chosen instead. Once
the lag value τ is selected, a parabolic interpolation of immediate neighbors is
done to increase the accuracy of the estimate, obtaining τ 0 , and the detected
fundamental frequency is finally set as f0 = fs /τ 0 .
YIN is a robust and reliable algorithm that have been successfully used as
basis for singing voice transcription methods, like the one proposed by Ryynänen
and Klapuri (2004).
Cepstrum
The real cepstrum of a signal is the inverse Fourier transform of the logarithm
of the magnitude spectrum.
CEPx [τ ] = IDFT{log(|DFT(x[n])|} 4.4
It was introduced for fundamental frequency estimation of speech signals by
Noll and Schroeder (1964), who gave a complete methodology in (Noll, 1967).
56
4.1. SINGLE FUNDAMENTAL FREQUENCY ESTIMATION
K−1
1 X
2πτ k
2 2
ACFx [τ ] = IDFT{|DFT(x[n])| } = cos |X[k]| 4.5
K K
k=0
Note that the cosine factor emphasizes the partial amplitudes at those
harmonic positions multiple of τ . The main difference between autocorrelation
and cepstrum is that autocorrelation uses the square of the DFT, and the
cepstrum performs the logarithm. Squaring the DFT causes to raise spectral
peaks but also the noise. Using the logarithm flats the spectrum, reducing noise
but also the harmonic amplitudes.
Therefore, as pointed out by Rabiner et al. (1976), the cepstrum performs
a dynamic compression over the spectrum, flattening unwanted components
and increasing the robustness for formants, but rising the noise level, whereas
autocorrelation emphasizes spectral peaks in relation to noise, but raising the
strength of spurious components.
Both ACF and cepstrum-based methods can be classified as spectral location
f0 estimators.
Spectral autocorrelation
The main drawback of the spectral location f0 estimators is that they are
very sensitive to harmonic deviations from their ideal position. Some methods,
like the one proposed by Lahat et al. (1987), perform autocorrelation over the
spectrum.
2
K/2−τ −1
X
ACFSX [τ ] = |X[k]||X[k + τ ]| 4.6
K
k=0
57
4. STATE OF THE ART
fmin
f0
fmax
signal and the model, via its parameters. The expectation maximization (EM) algorithm by
Moon (1996) can be used for maximum likelihood parameter estimation.
58
4.1. SINGLE FUNDAMENTAL FREQUENCY ESTIMATION
Figure 4.2: Two way mismatch procedure from Maher and Beauchamp (1994).
between each measured partial and its nearest harmonic neighbor in the
predicted sequence, whereas the second measures the mismatch between each
predicted harmonic and its nearest neighbor in the measured sequence. Each
match is weighted by the amplitudes of the observed peaks. This method tries to
reduce octave errors, applying a penalty to missing and extra harmonics relative
to the predicted pattern. The methodology was also used for duet3 separation4 .
Cano (1998) introduced some modifications over the TWM to improve the
original SMS analysis developed by Serra (1997). These modifications include
a pitch dependent analysis window using adaptive window length, a more
restrictive selection of spectral peaks to be considered, f0 tracking using short-
term history to choose between candidates with similar TWM error, to restrict
the frequency range of possible candidates, and to discriminate between pitched
and unpitched parts.
59
4. STATE OF THE ART
60
4.1. SINGLE FUNDAMENTAL FREQUENCY ESTIMATION
Figure 4.3: Combinations of note models and the musicological model from
Ryynänen and Klapuri (2004).
7 The frequency difference in semitones between the measured f and the nominal pitch of
0
the modeled note.
61
4. STATE OF THE ART
Finally, the two models are combined into a network (see Fig. 4.3), and
the most probable path is found according to the likelihoods given by the note
models and the musicological model. The system obtained half amount of errors
than the simple f0 estimation rounded to MIDI pitch, proving the capability of
probabilistic models for this task.
62
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
63
4. STATE OF THE ART
2001) before subtracting them from the mixture. The weights of each candidate
are calculated again after smoothing, and the highest recalculated global weight
determines the resulting f0 . The process stops when the maximum weight
related to the signal-to-noise ratio (SNR) is below a fixed threshold.
Klapuri (2005) proposed an alternative method using an auditory filter
bank. The signal at each subband is compressed, half-wave rectified and low-
pass filtered. Then, similarly to the SACF, the results are combined across
channels, but in this method magnitude spectra are summed across channels
to obtain a summary spectrum. The most salient f0 is computed using an
approximated 1/h spectral envelope model8 , to remove the source for the
mixture while keeping in the residual most of the energy of higher partials.
This model is similar to (Klapuri, 2006b), where the input signal is flattened
(whitened) to reduce timbral-dependant information, and the salience for each
f0 candidate is computed as a 1/h weighted sum of its partials. This same partial
weighting scheme is performed by Klapuri (2008) using a computationally
efficient auditory model.
The system introduced by Ryynänen and Klapuri (2005) embeds the multiple
f0 estimator from Klapuri (2005) into a probabilistic framework (see Fig.
4.5), similarly to the method from Ryynänen and Klapuri (2004) for singing
transcription. As in the latter work, a note event model and a musicological
model, plus a silence model are used, and note events are described using a
HMM for each pitch. The HMM inputs are the pitch difference between the
measured f0 and the nominal pitch of the modeled note, the f0 salience, and the
onset strength (positive changes in the estimated strengths of f0 values). The
musicological model controls transitions between note HMMs and the silence
model using note bigram probabilities which are dependent on the estimated
8 Spectral amplitudes decrease according to the partial number h, like in a sawtooth signal.
64
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
Figure 4.6: Overview of the joint estimation method from Yeh (2008).
key. Like in (Ryynänen and Klapuri, 2004), the acoustic and musicological
models are combined into a network which optimal path is found using the
token-passing algorithm from Young et al. (1989).
Other examples of iterative cancellation methods are those proposed by Wan
et al. (2005), Yin et al. (2005), and Cao et al. (2007).
65
4. STATE OF THE ART
10 The mean time is an indication of the center of gravity of signal energy (Cohen, 1995). It
can be defined in the frequency domain as the weighted sum of group delays.
11 The colliding partials are estimated by linear interpolation of non-colliding neighbors.
66
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
The harmonic tracking method from Marolt (2004a,b) was integrated into a
system called SONIC (see Fig. 4.7) to transcribe piano music. The combination
of the auditory model outputs and the partial tracking neural network outputs
are fed into a set of time delay neural networks (TDDN), each one corresponding
to a musical pitch. The system also includes an onset detection stage, which is
implemented with a fully-connected neural network, and a module to detect
repeated notes activations (consecutive notes with the same pitch). The
information about the pitch estimate is complemented with the output of the
repeated note module, yielding the pitch, length and loudness of each note.
The system is constrained to piano transcription, as training samples are piano
sounds.
67
4. STATE OF THE ART
Figure 4.8: HMM smoothed estimation from Poliner and Ellis (2007a) for an
excerpt of Für Elise (Beethoven). The posteriorgram (pitch probabilities as a
function of time) and the HMM smoothed estimation plotted over the ground-
truth labels (light gray) are shown.
Reis et al. (2008c) use genetic algorithms12 for polyphonic piano transcrip-
tion. Basically, a genetic algorithm consist on a set of candidate solutions
(individuals, or chromosomes) which evolve through inheritance, selection,
mutation and crossover until a termination criterion. At each generation, the
quality (fitness) of each chromosome is evaluated, and the best individuals are
chosen to keep evolving. Finally, the best chromosome is selected as the solution.
In the method proposed by Reis et al. (2008c), each chromosome corresponds
to a sequence of note events, where each note has pitch, onset, duration and
intensity. The initialization of the population is based on the observed STFT
peaks. The fitness function for an individual is obtained from the comparison of
the original STFT with the STFT of synthesized versions of the chromosomes
given an instrument. The method is constrained to the a priori knowledge of the
instrument to be synthesized. The system was extended in Reis et al. (2008b)
by combining the genetic algorithm with a memetric algorithm (gene fragment
competition), to improve the quality of the solutions during the evolutionary
process.
12 Genetic algorithms are evolutionary methods based on Darwin natural selection proposed
by Holland (1992).
68
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
notes of that instrument playing the same pitch produce a very similar sound, as it happens
with piano sounds. As an opposite example, a sax can’t be considered to have a fixed spectral
profile, as real sax sounds usually contain varying dynamics and expressive alterations, like
breathing noise, that do not sound in the same way than other notes with the same pitch.
69
4. STATE OF THE ART
Figure 4.9: NMF example from Smaragdis and Brown (2003). The original
score and the obtained values for H and W using 4 components are shown.
Raczynski et al. (2007), Virtanen (2007), Vincent et al. (2007) and Bertin et al.
(2007).
The independent component analysis (ICA), introduced by Comon (1994),
is closely related to the NMF. ICA can express a signal model as x = Wh,
being x and h n-dimensional real vectors, and W a non-singular mixing matrix.
Citing Virtanen (2006), ICA attempts to separate sources by identifying latent
signals that are maximally independent.
As pointed out by Schmidt (2008), the differences between ICA and NMF
are the different constraints placed on the factorizing matrices. In ICA, rows
of W are maximally statistically independent, whereas in NMF all elements
of W and H are non-negative. Both ICA and NMF have been investigated by
Plumbley et al. (2002) and Abdallah and Plumbley (2003a, 2004) for polyphonic
transcription. In the evaluation done by Virtanen (2007) for spectrogram
factorization, the NMF algorithms yielded better separation results than ICA.
These methods have been successfully used for drum transcription (see
(FitzGerald, 2004) and (Virtanen, 2006)), as most percussive sounds have a
fixed spectral profile and they can be modeled using a single component.
70
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
Figure 4.10: Modified MP algorithm from Leveau et al. (2008) for the extraction
of harmonic atoms.
that are selected from a dictionary. At the first iteration of the algorithm,
the atom which gives the largest inner product with the analyzed signal is
chosen. Then, the contribution of this function is subtracted from the signal
and the process is repeated on the residue. MP minimizes the residual energy
by choosing at each iteration the most correlated atom with the residual. As a
result, the signal is represented as a weighted sum of atoms from the dictionary
plus a residual.
The method proposed by Cañadas-Quesada et al. (2008) is based on
harmonic matching pursuit (HMP) from Gribonval and Bacry (2003). The
HMP is an extension of MP with a dictionary composed by harmonic atoms.
Within this context, a Gabor atom16 can be identified with a partial, and a
harmonic atom is a linear combination of Gabor atoms (i.e., a spectral pattern).
The algorithm from Cañadas-Quesada et al. (2008) extends HMP to avoid
inaccurate decomposition when there are overlapped partials, by maximizing the
smoothness of the spectral envelope for each harmonic atom. The smoothness
maximization algorithm is similar to the one proposed by Klapuri (2003a).
The performance of this method when dealing with harmonically related
simultaneous notes is further described by Ruiz-Reyes et al. (2009).
Leveau et al. (2008) propose a modified MP algorithm which can be applied
to the whole signal, instead of the frame by frame basis. The harmonic atoms
extraction method is shown in Fig. 4.10. Molecules are considered as a group
of several atoms of the same instrument in successive time windows.
16 Gabor atoms are time-frequency atomic signal decompositions proposed by Gabor (1946,
1947). They are obtained by dilating, translating and modulating a mother generating
function.
71
4. STATE OF THE ART
describes the short-time power spectrum of a musical excerpt as a sum of power spectra with
time-varying weights, using a Gaussian noise for modeling the error.
72
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
are the most predominant harmonic structures in high and low frequency regions
respectively. First, the STFT is apportioned through a multirate filterbank, and
a set of candidate frequency components are extracted. Then, two bandpass
filters are used to separate the spectral components of the bass and melody.
For each set of filtered frequency components, the method forms a probability
density function (PDF) of the f0 . The observed PDF is considered as being
generated by a weighted mixture of harmonic-structure tone models. The model
parameters are estimated using the EM algorithm. To consider a continuity
of the f0 estimate, the most dominant and stable f0 trajectory is selected,
by tracking peak trajectories in the temporal transition of the fundamental
frequencies PDFs. To do this, a salience detector selects salient promising peaks
in the PDFs, and agents driven by those peaks track their trajectories. The
system works in real time.
Kameoka et al. (2007) propose a method called harmonic temporal struc-
tured clustering (HTC). This approach decomposes the power spectrum time
series into sequential spectral streams (clusters) corresponding to single sources.
This way, the pitch, intensity, onset, duration, and timbre features of each
source are jointly estimated. The input of the system is the observed
signal, characterized by its power spectrogram with log-frequency. The source
model (see Fig. 4.12) assumes smooth power envelopes with decaying partial
amplitudes. Using this model, a goodness of the partitioned cluster is calculated
using the Kullback-Liebler (KL) divergence. The model parameters are
estimated using the expectation-constrained maximization (ECM) algorithm
73
4. STATE OF THE ART
Figure 4.12: HTC spectral model of a single source from Kameoka et al. (2007).
from Meng and Rubin (1993), which is computationally simpler than the EM
algorithm. In the evaluation done by Kameoka et al. (2007), the HTC system
outperformed the PreFEst results.
The method from Li and Wang (2007) is similar to (Ryynänen and Klapuri,
2005) in the sense that the preliminary pitch estimate and the musical pitch
probability transition are integrated into a HMM. However, for pitch estimation,
Li and Wang (2007) use statistical tone models that characterize the spectral
shapes of the instruments. Kernel density estimation is used to build the
instrument models. The method is intended for single instrument transcription.
74
4.2. MULTIPLE FUNDAMENTAL FREQUENCY ESTIMATION
The method proposed by Groble (2008) extracts feature vectors from the
spectrum data. These vectors are scored against models computed by scaled
averages of audio samples of piano and guitar from a training set. The training
data consists on sample frames at different note attack levels and distances from
the note onset. The predicted pitches are determined using simple distance
metrics from the observed feature vector to the dataset feature vectors.
75
4. STATE OF THE ART
76
4.4. ONSET DETECTION
Few supervised learning methods have been used as the core methodology for
polyphonic music transcription. Probably, this is because they rely on the data
given in the learning stage, and in polyphonic real music the space of observable
data is huge. However, supervised learning methods have been successfully
applied considering specific instrument transcription (usually, piano) to reduce
the search space. The same issue occurs in database matching methods, as they
also depend on the ground-truth data.
In contrast to supervised approaches, unsupervised learning methods do
not need a priori information about the sources. However, they are suitable
for fixed time-frequency profiles (like piano or drums), but modeling harmonic
sounds with varying harmonic components remains as a challenge (Abdallah
and Plumbley, 2004).
In general, music transcription methods based on Bayesian models are
mathematically complex and they tend to have high computational costs, but
they provide an elegant way of modeling the acoustic signal. Statistical spectral
model methods are also complex but they are computationally efficient.
Blackboard systems are general architectures and they need to rely on other
techniques, like a set of rules (Martin, 1996) or supervised learning (Bello, 2000),
to estimate the fundamental frequencies. However, the blackboard integration
concept provides a promising framework for multiple f0 estimation.
77
4. STATE OF THE ART
Audio signal
Preprocessing
stage
o(t) 1
0.8
function 0.4
0.2 θ
0
0 2,32 4,64 6,96 9,28 11,60 13,93 t
Peak picking
and thresholding * * * * * * * * * * * * * *** ** * * *
Figure 4.14: General architecture of most onset detection systems.
Energy-based detection
78
4.4. ONSET DETECTION
79
4. STATE OF THE ART
Transient-based detection
Although not every transient in a signal correspond to an onset, almost all the
musical sounds begin with a transient stage characterized by a non-stationary
part and an abrupt amplitude change.
The method proposed by Röbel (2005) performs the STFT to classify the
spectral peaks into transient peaks, which are potentially part of an attack, and
non transient peaks. This classification is based on the centroid of the time
domain energy of the signal segment related to the analyzed peak. A transient
statistical model determines whether the spectral peaks identified as transients
are produced by background noise or by an onset. The exact onset positions
are determined by estimating the starting time of the transient. The value of
the detection function is normalized, dividing the transient energy by the total
signal energy in the target frame, and a constant threshold is finally applied.
The hybrid approach from Duxbury et al. (2002) aims to detect hard onsets,
considering transient energy variations in the upper frequencies, and soft onsets,
with a FFT-based distance measure at the low frequencies. To do it, the signal
is first split into 5 bands using a constant-Q transform. A transient energy
measure is then used to find transient changes in the upper bands, whereas
the lowest band is analyzed to yield the standard Euclidean distance between
two consecutive FFT vectors. The detection function is based on the difference
between the signal for each band and a smoothed version of itself. The onsets
are detected using an automatic threshold based on a mixture of Gaussians or,
alternatively, on the derivative of the onsets histogram. Finally, onsets across
bands are combined to yield the final estimate through a weighted scheme.
80
4.4. ONSET DETECTION
Supervised learning
The system proposed by Lacoste and Eck (2007) use either STFT and
constant Q transforms in the preprocessing stage. The linear and logarithmic
frequency bins are combined with the phase plane to get the input features for
one or several feed-forward neural networks, which classify frames into onset
or non onset. In the multiple network architecture, the tempo trace is also
estimated and used to condition the probability for each onset. This tempo
trace is computed using the cross-correlation of the onset trace with the onset
trace autocorrelation within a temporal window. A confidence measure that
weights the relative influence of the tempo trace is provided to the network.
Marolt et al. (2002) use a bank of 22 auditory filters to feed a fully connected
network of integrate-and-fire neurons. This network outputs a series of impulses
produced by energy oscillations, indicating the presence of onsets in the input
signal. Due to noise and beating, not all the impulses correspond to onsets.
To decide which impulses are real onsets, a multilayer perceptron trained with
synthesized and real piano recordings is used to yield the final estimates.
Support vector machines have also been used for onset detection by Kapanci
and Pfeffer (2004) and Davy and Godsill (2002) to detect abrupt spectral
changes.
Unsupervised learning
Unsupervised learning techniques like NMF and ICA have been also applied for
onset detection.
Wang et al. (2008) generate the non-negative matrices with the magnitude
spectra of the input data. The basis matrices are the temporal and frequency
patterns. The temporal patterns are used to obtain three alternative detection
functions: a first-order difference function, a psychoacoustically motivated
relative difference function, and a constant-balanced relative difference function.
These ODFs are similarly computed by inspecting the differences of the temporal
patterns.
81
4. STATE OF THE ART
82
Onset detection using a harmonic
5
filter bank
A novel approach for onset detection is presented in this chapter. The audio
signal is analyzed through a 1/12 octave (one semitone) band-pass filter bank
simulated in the frequency domain, and the temporal derivatives of the filtered
values are used to detect spectral variations related to note onsets.
As previously described, many onset detection methods apply a preprocess-
ing stage by decomposing the signal into multiple frequency bands. In the
perceptually motivated onset detector proposed by Klapuri (1999), a set of
critical band filters is used. Scheirer (1998) uses a six band filter bank, each
one covering roughly one octave range, and Duxbury et al. (2002) performs a
sub-band decomposition of the signal.
The motivation of the proposed approach is based on the characteristics
of most harmonic pitched sounds. The first 5 harmonics of a tuned sound
coincide1 with the frequencies of other pitches in the equal temperament (see
Fig. 2.16). Other characteristic of these sounds is that usually most of their
energy is concentrated in the first harmonics. A one semitone filter bank is
composed by a set of triangular filters, which center frequencies coincide with
the musical pitches (see Fig. 5.1).
In the sustain and release stages of a sound, there can be slight variations
in the intensity (tremolo) and the frequency (vibrato) of the harmonics. For
instance, a harmonic peak at the frequency bin k in a given frame can be
shifted to the position k + 1 in the following frame. In this scenario, direct
spectra comparison, like spectral flux (see Eq. 2.14), may yield false positives,
as intensity differences are detected. Using this musically motivated filter bank,
the value of the band which center is close to k will be similar in both frames,
avoiding a false detection.
83
5. ONSET DETECTION
... ...
...
b1 ... bi bB Energy in
each band
5.1 Methodology
For detecting the beginnings of the notes in a musical signal, the method
analyzes the spectrum information across one semitone filter bank, computing
the band differences in time to obtain a detection function. Peaks in this
function are extracted, and those which values are over a threshold are
considered as onsets.
5.1.1 Preprocessing
From a digital audio signal, the STFT is computed, providing its magnitude
spectrogram. A Hanning window with 92.9 ms length is used, with a 46.4 ms
hop size. With these values, the temporal resolution achieved is ∆t = 46.4 ms,
and the spectral resolution is ∆f = 10.77 Hz.
2 http://grfia.dlsi.ua.es/cm/worklines/pertusa/onset/pertusa_onset.tgz
84
5.1. METHODOLOGY
Using a 1/12 octave filter bank, the filter corresponding to the pitch G]0
has a center frequency of 51.91 Hz, and the fundamental frequency of the next
pitch, A0 , is 55.00 Hz, therefore this spectral resolution is not enough to build
the lower filters. Zero padding was used to get more points in the spectrum.
Using a zero padding factor z = 4, three additional windows with all samples set
to zero were appended at the end of each frame before doing the STFT. With
this technique, a frequency resolution ∆f = 10.77/4 = 2.69 Hz is eventually
obtained.
At each frame, the spectrum is apportioned among a one semitone filter bank
to produce the corresponding filtered values. The filter bank comprises from 52
Hz (pitch G]0 ) to the Nyquist frequency to cover all the harmonic range. When
fs = 22, 050 Hz, B = 94 filters are used3 , which center frequencies correspond to
the fundamental frequencies of the 94 notes in that range. The filtered output
at each frame is a vector b with B elements (b ∈ B ). R
b = {b1 , b2 , . . . , bi , . . . , bB } 5.1
Each value bi is obtained from the frequency response Hi of the correspond-
ing filter i with the spectrum. The Eq. 5.2 is used4 to compute the filtered
values:
v
uK−1
uX
bi = t (|X[k]| · |Hi [k]|)2 5.2
k=0
Like in other onset detection methods, as (Bilmes, 1993), (Goto and Muraoka,
1995, 1996), and (Scheirer, 1998), a first order derivative function is used to
pick potential onset candidates. In the proposed approach, the derivative c[t] is
computed for each filter i.
d
ci [t] = bi [t] 5.3
dt
85
5. ONSET DETECTION
o(t) 1
0.8
0.6
0.4
0.2 θ
0
0 2,32 4,64 6,96 9,28 11,60 13,93 t
Figure 5.2: Example of the onset detection function o[t] for a piano melody,
RWC-MDB-C-2001 No. 27 from Goto (2003), RWC database.
The values for each filter must be combined to yield the onsets. In order to
detect only the beginnings of the events, the positive first order derivatives of
all the bands are summed at each time, whereas negative derivatives, which can
be associated with offsets, are discarded:
B
X
a[t] = max {0, ci [t]}. 5.4
i=1
To normalize the onset detection function, the overall energy s[t] is also
computed (note that a[t] < s[t]):
B
X
s[t] = bi [t] 5.5
i=1
The sum of the positive derivatives a[t] is divided by the sum of the filtered
values s[t] to compute a relative difference. Therefore, the onset detection
function o[t] ∈ [0, 1] is:
B
X
max {0, ci [t]}
a[t] i=1
o[t] = = B 5.6
s[t] X
bi [t]
i=1
Fig. 5.2 shows an example of the onset detection function o[t] for a piano
excerpt, where all the peaks over the threshold θ were correctly detected onsets.
86
5.1. METHODOLOGY
The previous methodology yields good results for instruments that have a sharp
attack, like a piano or a guitar. But for instruments with a very smooth attack,
like violins, more frames should be considered. For these sounds, Eq. 5.3 can be
replaced by:
C
X
c̃i [t] = j · (bi [t + j] − bi [t − j]) , 5.7
j=1
B
X
ã[t] = max {0, c̃i [t]} 5.8
i=1
With these equations, Eq. 5.5 must be replaced by Eq. 5.9 to normalize õ[t]
into the range [0, 1]:
B X
X C
s̃[t] = j · bi [t + j] 5.9
i=1 j=1
B
X
max {0, c̃i [t]}
ã[t] i=1
õ[t] = = B C 5.10
s̃[t] XX
j · bi [t + j]
i=1 j=1
87
5. ONSET DETECTION
o[t]
o(t) 1
0.8
0.6
0.4
0.2 θ
0
0 2,32 4,64 6,96 9,28 11,60 13,93 t
õ[t]
o(t) 1
0.8
0.6
0.4
0.2 θ
0
0 2,32 4,64 6,96 9,28 11,60 13,93 t
o(t)
õ[t] 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2 θ
0.1
0
0 2,32 4,64 6,96 9,28 11,60 13,93 t
Figure 5.3: Onset detection function for a polyphonic violin song (RWC-MDB-
C-2001 No. 36 from Goto (2003), RWC database). (a) o[t]; (b) õ[t], with C = 1;
(c) õ[t], with C = 2. With C = 2, all the onsets were successfully detected
except by one, which is marked with a circle.
88
5.2. EVALUATION WITH THE ODB DATABASE
89
5. ONSET DETECTION
Figure 5.4: Onsets from RWC-MDB-C-2001 No. 27 from Goto (2003), RWC
database, labeled with speech filling system (SFS).
0.9
0.8
0.7
0.6
0.5
0.4
Precision
Recall
0.3
0 0.1 0.2 0.3 0.4 0.5
θ
Figure 5.5: Onset detection (o[t]) precision and recall curves in function of the
threshold θ, using a constant value for the silence threshold µ = 70.
90
5.2. EVALUATION WITH THE ODB DATABASE
Table 5.1: Onset detection results using the proposed database (ODB). The
table shows the number of correctly detected onsets (OK), false positives (FP),
false negatives (FN), merged onsets (M), doubled onsets (D), precision (P),
recall (R), and F-measure (F-m).
The detailed results using o[t] with these thresholds can be seen in Tab. 5.1.
The overall F-measure achieved was 84.48%.
In order to get a perceptual evaluation of the results, once performed the
onset detection, new audio files10 were generated using CSound by adding to the
original waveform a click sound in the positions where the onsets were detected.
In order to compare the method with other approaches, two publicly available
onset detection algorithms were evaluated using the ODB database. The
experiments were done comparing the onset times obtained by BeatRoot11 and
aubio12 with the ground-truth onsets of the ODB database using the evaluation
methodology previously described.
BeatRoot, introduced by Dixon (2006), is a software package for beat
tracking, tempo estimation and onset detection. To evaluate the method, the
onset times were obtained using the BeatRoot-0.5.6 default parameters with the
following command:
91
5. ONSET DETECTION
System OK FP FN M D Pr % Re % F-m %
Pertusa et al. (2005) 1873 406 282 3 3 82.19 86.91 84.48
Brossier (2005) - aubio 1828 608 327 79 80 75.04 84.83 79.63
Dixon (2006) - BeatRoot 1526 778 629 21 21 66.23 70.81 68.45
Table 5.2: Comparison with other methods using the ODB database and o[t].
In the method from Dixon (2006), different onset detection functions based
on spectral flux, phase deviation, and complex domain13 can be selected.
The onset detection function values are normalized and a simple peak picking
algorithm is used to get the onset times.
Aubio is the implementation of the algorithm proposed by Brossier (2005),
submitted to the MIREX 2005 contest and previously described in Sec. 4.4.
Like in BeatRoot, the default parameters were used for the evaluation of this
method:
The results (Tab. 5.2) show that the proposed method outperforms these
two approaches using the ODB data set.
13 Based on the estimation of the expected amplitude and phase of the current bin according
92
5.3. MIREX EVALUATION
Reference OK FP FN M D Pr % Re % F-m %
RWC-C02 47 35 60 0 0 57.32 43.93 49.74
RWC-C03 25 38 31 0 0 39.68 44.64 42.02
RWC-C36 45 5 0 0 0 90.00 100.00 94.74
RWC-C38 101 45 93 5 5 69.18 52.06 59.41
Realorgan3 11 13 4 0 0 45.83 73.33 56.41
14 http://alg.ncsa.uiuc.edu/do/tools/d2k
15 Music to knowledge, http://www.music-ir.org/evaluation/m2k/
16 International Music Information Retrieval Systems Evaluation Laboratory.
93
5. ONSET DETECTION
94
5.3. MIREX EVALUATION
Participant OK FP FN M D Pr % Re % F-m %
Röbel (2009) 10 hd 7015 1231 2340 161 133 85.00 79.19 79.60
Röbel (2009) 7 hd 7560 2736 1795 188 257 81.32 83.30 79.00
Röbel (2009) 19 hdc 7339 2367 2016 185 212 80.56 81.88 78.31
Pertusa and Iñesta (2009) 6861 2232 2494 196 10 79.99 77.50 76.79
Röbel (2009) 16 nhd 6426 846 2929 148 183 86.39 73.62 76.48
Röbel (2009) 12 nhd 6440 901 2915 145 198 85.96 73.15 76.10
Tan et al. (2009) 1 6882 2224 2473 157 308 75.67 76.97 74.43
Tan et al. (2009) 2 6588 1976 2767 152 266 78.28 74.58 73.38
Tan et al. (2009) 3 5961 1703 3394 146 285 79.61 68.97 68.63
Tan et al. (2009) 5 7816 5502 1539 84 1540 62.88 83.69 68.23
Tan et al. (2009) 4 5953 1843 3402 135 345 78.98 68.91 67.94
Tzanetakis (2009) 5053 2836 4302 162 46 67.01 59.91 59.54
Table 5.4: Overall MIREX 2009 onset detection results ordered by F-measure.
The precision, recall and F-measure are averaged. The highest F-measure was
obtained using θ = 0.25.
95
5. ONSET DETECTION
Table 5.6: Detailed MIREX 2009 onset detection results for the proposed
method with the best θ for each class. The precision, recall and F-measure
are averaged. The best F-measure among the evaluated methods is also shown.
96
5.3. MIREX EVALUATION
Figure 5.7: MIREX 2009 onset detection F-measure respect to the threshold θ
for the different sound classes using the proposed method.
97
5. ONSET DETECTION
Bars and bells have percussive onsets and they are typically pitched,
although most of these sounds are inharmonic. Therefore, their energy may
not be concentrated in the central frequencies of the one semitone bands. In
the proposed method, when this happens and the harmonics slightly oscillate
in frequency, they can easily reach adjacent bands, causing some false positives.
Anyway, it is difficult to derive conclusions for this class of sounds, as only 4
files were used for the evaluation and the MIREX data sets are not publicly
available.
Interestingly, the proposed approach also yielded good results with unpitched
sounds, and it obtained the highest F-measure in solo-drum excerpts among all
the evaluated methods.
The best threshold value for poly-pitched and complex sounds was around
θ = 0.20, which coincides with the threshold experimentally obtained with the
ODB database. Using this threshold, the overall F-measure is only 1% lower
(see Fig. 5.7) than with the best threshold θ = 0.25 for the whole MIREX data
set, therefore the differences are not significant.
5.4 Conclusions
An efficient novel approach for onset detection has been described in this
chapter. In the preprocessing stage, the spectrogram is computed and
apportioned through a one semitone filter bank. The onset detection function
is the normalized sum of temporal derivatives for each band, and those peaks
in the detection function over a constant threshold are identified as onsets.
A simple variation has been proposed, considering adjacent frames in order
to improve the accuracy for non-percussive pitched onsets. In most situations,
õ[t] yields lower results than without considering additional frames, therefore it
is only suitable for a few specific sounds.
The method has been evaluated and compared with other works in the
MIREX (2009) audio onset detection contest. Although the system is mainly
designed for tuned pitched sounds, the results are competitive for most timbral
categories, except for speech or inharmonic pitched sounds.
As the abrupt harmonic variations produced at the beginning of the notes are
emphasized and those produced in the sustain stage are minimized, the system
performs reasonably well against smooth vibratos lower than one semitone.
Therefore, o[t] is suitable for percussive harmonic onsets, but it is also robust
to frequency variations in the sustained sounds.
When a portamento occurs, the system usually detects a new onset when
the f0 increases or decreases more than one semitone, resulting in some false
positives. However, this is not a drawback if the method is used for multiple
pitch estimation.
98
5.4. CONCLUSIONS
99
Multiple pitch estimation using
6
supervised learning methods
101
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
6.1 Preprocessing
Supervised learning techniques require a set of input features aligned with
the desired outputs. In the proposed approach, a frame by frame analysis
is performed, building input-output pairs at each frame. The input data are
spectral features, whereas the outputs consist on the ground-truth pitches. The
details of the input and output data and their construction are described in this
section.
of context units in the input layer connected with the hidden layer units.
102
6.2. SUPERVISED METHODS
Input data
The training data set consists of musical audio files at fs = 22, 050 Hz
synthesized from MIDI sequences. The STFT of the each musical piece is
computed, providing the magnitude spectrogram using a 93 ms Hanning window
with a 46.4 ms hop size. With these parameters, the time resolution for
the spectral analysis is ∆t = 46.4 ms, and the highest possible frequency is
fs /2 = 11025 Hz, which is high enough to cover the range of useful pitches.
Like in the onset detection method, zero padding has been used to build the
lower filters.
The same way as described in Chapter 5 for onset detection, the spectral
values at each frame are apportioned into B = 94 bands using a one semitone
filter bank ranging from 50 Hz (G]0 ) to fs /2, almost eight octaves, yielding a
vector of filtered values b[t] at each frame.
These values are converted into decibels and set as attenuations from the
maximum amplitude, which is 96 dB4 with respect to quantization noise. In
order to remove noise and low intensity components at each frame, a threshold
ξ is applied for each band in such a way that, if bi [t] < ξ, then bi [t] = ξ. This
threshold was empirically established at ξ = −45 dB. This way, the input data
is within the range b[t] ∈ [ξ, 0]B .
Information about adjacent frames is also considered to feed the classifiers.
For each frame at time t, the input is a set of spectral features {b[t + j]} for
j ∈ [−m, +n], being m and n the number of spectral frames considered before
and after the frame t, respectively.
Output data
For each MIDI file, a binary digital piano-roll (BDP) is obtained to get the
active pitches (desired output) at each frame. A BDP is a matrix where each
row corresponds to a frame and each column corresponds to a MIDI pitch (see
Fig. 6.1). Therefore, at each frame t, n + m + 1 input vectors b[t + j] for
j ∈ [−m, +n] and a vector of pitches ν[t] ∈ {0, 1}B are shown to the supervised
method during the training stage.
4 Given the audio bit depth of the generated audio files (16 bits), the maximum amplitude
103
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
pitch
time
...
G!0 F8
Figure 6.1: Binary digital piano-roll coding in each row the active pitches at
each time frame when the spectrogram is computed.
1
b̄i [t] = (ξ + bi [t]) − 1 6.1
ξ/2
This way, the input data bi [t] ∈ [ξ, 0] are mapped into b̄i [t] ∈ [−1, +1] for
the network input. Each of these values, which corresponds to one spectral
component, is provided to a neuron at the input layer. The adjacent frames
provide the short-context information. For each frame considered, B new input
units are added to the network, being the total number of input neurons B(n +
m + 1).
The network output layer is composed of B = 94 neurons, one for each
possible pitch. The output is coded in such a way that an activation value of
yk [t] = 1 for a particular unit k means that the k-th pitch is active at that
frame, whereas yk [t] = 0 means that the pitch is not active.
The TDNN has been implemented with bias5 and without momentum6 . The
selected transfer function f (x) is a standard sigmoid (see Fig. 6.3):
2
f (x) = −1 6.2
1 + e−x
5 A bias neuron lies in one layer, is connected to all the neurons in the next layer but none
unless acted upon by outside forces, allows the network to learn more quickly when there exist
plateaus in the error surface (Duda et al., 2000).
104
6.2. SUPERVISED METHODS
k[t]= 0, ... , b -1
ν(ti , k) ; ν
.....
................ h[t]
h(t i)
S(f
b̄[tk ,−ti-m
m]) S(fb̄[t]
k , ti ) S(f k,+
b̄[t ti+n
n] )
k = 0, ... , b -1 k = 0, ... , b -1 k = 0, ... , b -1
Figure 6.2: TDNN architecture and data supplied during training. The arrows
represent full connection between layers.
11
0.5
0.5
00
-0.5
-0.5
-1-1
-10 -5
-10 -5 00 55 10
10
After performing the transfer function, the output values for the neurons are
within the range yk [t] ∈ [−1, +1]. A pitch is detected when yk [t] > α. Therefore,
the activation threshold α controls the sensitivity of the network (the lower is
α, the more likely a pitch is activated).
k
X
Ap [t] = νp(i) [t] 6.3
i=0
105
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
(i)
being νp [t] ∈ {0, 1} the corresponding pitch p (not present/present) in the
prototype ν (i) [t]. Then, a low level threshold ζ is established as a fraction of
k, and only the pitches which accomplish Ap [t] ≥ ζ in the neighboring are
considered to be active at the frame t. This way, the method can infer new
prototypes that are not present in the training stage.
Instead of using the number of pitch occurrences as an activation criterion,
additional experiments have been done using weighted distances, summing the
multiplicative inverse of the Eucledian distance di [t] for each neighbor i to
increase the importance of the pitches that are close to the test sample:
k
X
1
A0p [t] = νp(i) [t] 6.4
i=0
di [t] + 1
A third activation function has been proposed, taking into account the
normalized distances:
k
X
1
di [t]
A00p [t] = νp(i) [t] 1− P 6.5
i=0
k−1 ∀i di [t]
In all these cases, if the activation function obtains a value greater than ζ,
then the pitch p is added to the prototype yielded at the target frame t.
6.3 Evaluation
A data set of MIDI sequences were utilized for the evaluation of the proposed
methods, obtaining input/output pairs from the MIDI files and the synthesized
audio. Then, 4-folded cross-validation experiments were performed, making four
subexperiments dividing the data set into four parts (3/4 for training and 1/4
for test). The presented results were obtained by averaging the subexperiments
carried out on each data subset. The accuracy of the method is evaluated at
frame-by-frame and note levels.
The frame (or event) level accuracy is the standard measure for multiple
pitch estimation described in Eq. 3.6. A relaxed novel metric has been proposed
to evaluate the system at note level. Notes are defined as series of consecutive
event detections along time. A false positive note is detected when an isolated
series of consecutive false positive events is found. A false negative note is
defined as a sequence of isolated false negative events, and any other sequence
of consecutive event detections is considered as a successfully detected note.
Eq. 3.6 is also used for note level accuracy, considering false positive, false
negative and correctly detected notes.
106
6.3. EVALUATION
Sinusoidal waveshape
This is the simplest periodic wave. Almost all the spectral energy is concentrated
in the f0 component.
Sawtooth waveshape
This sound contains all the harmonics with amplitudes proportional to 1/h,
being h the number of harmonic. Only the first H = 10 harmonics were used
to generate this sound.
Clarinet waveshape
The clarinet sound is generated using a physical model of a clarinet with the
wgclar Csound opcode, which produces good imitating synthesis.
107
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
108
6.4. RESULTS USING TIME-DELAY NEURAL NETWORKS
109
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
0.8
0.4
0.2
0
C1 C2 C3 C4 C5 C6 C7 C8
Pitch
the clarinet and the Hammond suggest that the methodology can be applied to
other instruments characterized by a nearly stable amplitude envelope.
The errors have been analyzed considering note length, pitch, and number
of training samples. Errors produced by notes shorter than 100 ms represent
the 31% of the total amount of errors. With a time resolution ∆t = 46 ms,
these notes extend along one or two frames. Since most of the false negatives
occur at the beginning and end of the notes, these very short notes, which are
not usual in real music, are sometimes missed.
As shown in Fig. 6.8, most of pitch errors correspond to very high (higher
than C7 ) and very low (lower than C3 ) pitches which are very unfrequent in real
music, whereas the method has a very high success rate in the central range of
pitches. This effect is partially related with the amount of musical pitches in
the training set, which is composed of musical data. The most frequent musical
pitches are those at valid central frequencies. There exists a clear correlation of
recognition success for a given pitch to the amount of events in the training set
for that pitch. In Fig. 6.9, each dot represents a single pitch. Abcises represent
the amount of data for that pitch in the training set, whereas ordinates represent
the recognition accuracy. An exponential curve has been adjusted to the data
showing the clear non linear correlation between the amount of training data
and the performance.
Another reason to explain these errors is that lowest pitches are harder to
detect due to the higher frequency precision required, and highest pitches have
less harmonics below Nyquist. Moreover, the harmonics of highest pitches can
110
6.4. RESULTS USING TIME-DELAY NEURAL NETWORKS
0.8
Pitch detection accuracy
0.6
0.4
0.2
0
0 1 2 3 4 5 6 7
Percentage of pitch events in the training set
Figure 6.9: TDNN correlation between recognition rates for each pitch and the
amount of events in the training set for that pitch.
also produce aliasing when they are synthesized. Anyway, most of the wrong
estimates correspond to very unusual notes that were artificially introduced in
the data set to spread out the pitch range, and which are not common in real
music.
A graphical example of the detection is shown in Fig. 6.10. This musical
excerpt was neither in the training set nor in the recognition set. It was
synthesized using the clarinet timbre11 , and with a fast tempo (120 bpm). In
this example, the event detection accuracy was 0.94, and most of the errors were
produced in the note onsets or offsets. Only 3 very short false positive notes
were detected.
111
112
C5
C4
C3
C5
C4
C3
Figure 6.10: Temporal evolution of the note detection for a given melody using the clarinet timbre. Top: the original score;
center: the melody as displayed in a sequencer piano-roll; down: the piano-roll obtained from the network output compared
with the original piano-roll. Notation: ‘o’: successfully detected events, ‘+’: false positives, and ‘-’: false negatives.
6.5. RESULTS USING K NEAREST NEIGHBORS
Table 6.2: Frame level cross-detection results using TDNN. Rows correspond
to training timbres and columns to test timbres.
Table 6.3: Note level cross-detection results using TDNN. Rows correspond to
training timbres and columns to test timbres.
The accuracy ranges from 0.089 to 0.65 for pitch recognition of sounds
which are different from those used to train the TDNN, showing the network
specialization. The cross-detection value could be an indication of the similarity
between two timbres, but this assumption needs of further in-deep study.
113
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
1
k/6
k/5
0.9 ζ k/4
k/3
k/2
2k/3
0.8
0.7
0.6
Acc 0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
Figure 6.11: Event detection accuracy using Ap for the sinusoidal timbre with
respect to k and ζ.
have provided the best results for event and note detection (see Tabs. 6.4 and
6.5). When k becomes large, the accuracy decreases, and good values for k
are relatively small (from 20 to 50). The behavior is similar for all the tested
timbres. In most cases, Ap obtains a significantly higher accuracy than when
using only one nearest neighbor.
No significant differences were found comparing the best results for A0p and
Ap . However, when using A0p , the number of neighbors does not affect much the
results (see 6.12). The highest accuracy was obtained with ζ ∈ {k/200, k/300}.
The best results for most timbres were obtained using A00 p with k = 20.
Tabs. 6.4 and 6.5 show the success rate for events and notes using k = 20,
which is the best k value among those tested for most timbres (except from the
sinusoidal waveshape, where k = 50 yielded a slightly higher accuracy). It can
be seen that A00p obtains the highest accuracy for most timbres. Anyway, the
best results are significantly worse than those obtained using the TDNN.
6.6 Conclusions
In this chapter, different supervised learning approaches for multiple pitch
estimation have been presented. The input/output pairs have been generated
by sequencing a set of MIDI files and synthesizing them using CSound. The
magnitude STFT apportioned through one semitone filter-bank is used as input
114
6.6. CONCLUSIONS
1
k/1000
k/300
0.9 ζ k/200
k/100
k/10
k
0.8
0.7
0.6
Acc 0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
Figure 6.12: Event detection accuracy using A0p for the sinusoidal timbre with
respect to k and ζ.
1
0.1
0.9
ζ 0.2
0.3
0.4
0.5
0.6
0.8 0.7
0.8
0.9
0.7
0.6
Acc 0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700 800 900 1000
Figure 6.13: Event detection accuracy using A00p for the sinusoidal timbre with
respect to k and ζ.
115
6. MULTIPLE PITCH ESTIMATION USING SUPERVISED LEARNING
116
6.6. CONCLUSIONS
data, whereas the outputs are the ground-truth MIDI pitches. Two different
supervised learning methods (TDNN and kNN) have been used and compared
for this task using simple stationary sounds and taking into account adjacent
spectral frames.
The TDNN performed far better than the kNN, probably due to the huge
space of possible pitch combinations. The results suggest that the neural
network can learn a pattern for a given timbre, and it can find it in complex
mixtures, even in the presence of beating or harmonic overlap. The success
rate was similar in average for the different timbres tested, independently of the
complexity of the pattern, which is one of the points in favour of this method.
The performance using the nearest neighbors is clearly worse than the TDNN
approach. Different alternatives were proposed to generalize in some way the
prototypes matched using the kNN technique to obtain new classes (pitch
combinations) not seen in the training stage. However, these modifications
did not improve significantly the accuracy. An interesting conclusion from this
comparison is that kNN techniques are not a good choice for classification when
there exists many different prototype labels, as in this particular task.
Respect to the TDNN method, errors are concentrated in very low/high
frequencies, probably due to the sparse presence of these pitches in the training
set. This fact suggests that increasing the size and variety of the training set, the
accuracy could be improved. In the temporal dimension, most of the errors are
produced in the note boundaries, which are not very relevant from a perceptual
point of view. This is probably caused by the window length, which can cover
transitions between different pitch combinations. When the test waveshape
was different from that used to train the net, the recognition rate decreased
significantly, showing the high specialization of the network.
The main conclusions are that a TDNN approach can estimate accurately the
pitches in simple waveforms, and the compact input using a one semitone filter-
bank is representative of the spectral information for harmonic pitch estimation.
Future work include to test the feasibility for this approach for real mixtures
of sounds with varying temporal envelopes, but this requires of a large labeled
data set for training, and it is difficult to get musical audio pieces perfectly
synchronized with the ground-truth pitches. However, this is a promising
method that should be deeply investigated with real data.
It also seems reasonable to provide the algorithm with a first timbre
recognition stage, at least at instrument family level. This way, different weight
sets could be loaded in the net according to the decision taken by the timbre
recognition algorithm before starting the pitch estimation.
117
Multiple fundamental frequency
7
estimation using signal processing
methods
and struck string instruments such as piano, guitar and pizzicato violin.
119
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
The methods described in this chapter are implemented in C++, and they
can be compiled and executed from the command line in Linux and Mac
OSX. Two standard C++ libraries have been used, for loading the audio files
(libsndfile2 ), and computing the Fourier transforms (FFTW3 from Frigo and
Johnson (2005)). The rest of code, including the generation of MIDI files, has
been implemented by the author.
7.1.1 Preprocessing
In the preprocessing stage, the magnitude spectrogram is obtained performing
the STFT with a 93 ms Hanning-windowed frame and a 46 ms hop size. This
window size may seem long for typical signal processing algorithms, but for chord
identification pitch margin is wide (Klapuri, 2003a), and this is also the frame
length used in many previous methods for multiple f0 estimation, like (Klapuri,
2006b; Ryynänen and Klapuri, 2005). Using these spectral parameters, the
temporal resolution achieved is 46 ms. Zero padding has been used, multiplying
the original size of the window by a factor z to complete it with zeroes before
computing the STFT. With this technique, the frequency of the lower pitches
can be more precisely estimated.
Then, at each frame, a sinusoidal likeness measure (SLM) is calculated to
identify the spectral peaks that correspond to sinusoids, allowing to discard the
2 http://www.mega-nerd.com/libsndfile/
120
7.1. ITERATIVE CANCELLATION METHOD
Waveform
Candidate selection
Iterative cancellation
Postprocessing
MIDI pitches
121
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
60
50
40
30
20
10
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Figure 7.2: Example magnitude spectrum (top) and SLM (bottom) for two
sounds in an octave relation (92.5 and 370 Hz) using W = 50 Hz. The
fundamental frequencies are indicated with arrows.
X
|H|2Ω = |H(Ω − ωk )|2 7.2
k,|Ω−ωk |<W
X
|X|2Ω = |X(ωk )|2 7.3
k,|Ω−ωk |<W
|Γ(Ω)|
vΩ = 7.4
|H|Ω |X|Ω
An efficient implementation of vΩ has been chosen using the method
proposed by Virtanen (2000). The cross-correlation of the frequency domain
is performed through a multiplication of the time-domain signals, and Γ(ω) is
calculated using the DFT for x[n] windowed twice with the Hanning function.
The calculation of |X|2Ω can be implemented with a IIR filter which has only
two non-zero coefficients: one delay takes a cumulative sum of the signal and
the other subtracts the values at the end of the window.
After this process, a sinusoidal likeness function (see Fig. 7.2) is obtained
at each frame. The harmonic peak selection is done as follows: if there is a
peak in the SLM function which value is vΩ > τ , being τ a constant threshold,
122
7.1. ITERATIVE CANCELLATION METHOD
then the original spectral component at the same frequency Ω, with its original
amplitude, is added to the harmonics list. The spectral components that
do not satisfy the previous condition are discarded. Therefore, the mid-level
representation of the proposed method consists on a sparse vector containing
only certain values of the original spectrum (those ideally corresponding to
partials). This sparse representation reduces the computational cost with
respect to the analysis of all spectral peaks.
123
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
124
7.1. ITERATIVE CANCELLATION METHOD
p = {1, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01} 7.5
125
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
7.1.5 Postprocessing
Those candidates with a low absolute or relative intensity are removed. First,
the pitch candidates with a intensity ln < γ are discarded. The maximum
note intensity L = max∀n {ln } at the target frame is calculated to remove the
candidates with ln < ηL, as the sources in the mixture should not have very
important energy differences7 .
Finally, the frequencies of the selected candidates are converted to MIDI
pitches with Eq. 2.21. Using this inter-onset based scheme, there are certain
ambiguous situations that are not produced in a frame by frame analysis. If a
pitch is detected in the current and previous inter-onset interval, then there are
two possibilities: there exists a single note spanning both onsets, or there is a
new note with the same pitch.
To make a simple differentiation between new notes and detections of pitches
that were already sounding in the previous frames, the estimation is done at
frames to + 1, and to − 1. If a detected pitch at frame to + 1 is not detected
at to − 1, then a new note is yielded. Otherwise, the note is considered to be a
continuation of the previous estimate.
126
7.2. JOINT ESTIMATION METHOD I
has also been used in different ways in the literature (Klapuri, 2003a, Yeh et al.,
2005, Cañadas-Quesada et al., 2008, and Zhou et al., 2009). The proposed novel
smoothness measure is based on the convolution of the hypothetical harmonic
pattern with a gaussian window.
Given a combination, the HPS of each candidate is calculated considering the
harmonic interactions with the partials of all the candidates in the combination.
The overlapped partials are first identified, and their amplitudes are estimated
by linear interpolation using the non-overlapped harmonic amplitudes.
In contrast with the previous iterative cancellation method, which assumes a
constant harmonic pattern, the proposed joint approach can estimate hypothet-
ical harmonic patterns from the spectral data, evaluating them according to the
properties of harmonic sounds. This approach is suitable for most real harmonic
sounds, in contrast with the iterative method, which assumes a constant pattern
based in percussive string instruments.
7.2.1 Preprocessing
In the preprocessing stage, the STFT is computed using a 93 ms Hanning
windowed frame, with a 9.28 ms hop size. The frame overlap ratio may seem
high from a practical point of view, but it was required to compare the method
with other works in the MIREX (2007) evaluation contest (see Sec. 7.4.3). Like
in the iterative cancellation method, zero padding has been used to get a more
precise estimation of the lower frequencies.
SLM has not been used in the joint estimation approaches. Experimentally,
including SLM in the iterative cancellation algorithm did not improve the results
(see Sec. 7.4.1 for details), so it was removed in the joint estimation methods
127
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
10 As the candidates are spectral peaks, timbres with missing fundamental are not
128
7.2. JOINT ESTIMATION METHOD I
P
X F
P
X F!
N= = 7.6
n=1
n n=1
n!(F − n)!
This means that when the maximum polyphony is P = 6 and there are
F = 10 selected candidates, N = 847 combinations are generated. Therefore,
N combinations are evaluated at each frame, and the adequate selection of F
and P is critical for the computational efficiency of the algorithm.
129
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
A Spectral peaks
f
f1 f2
Partial identification
A ff11
ff22
f
f1 f2
Linear subtraction
A A
f f
f1 HPS(f1) HPS(f2)
f2
amplitudes are directly assigned to the HPS. However, the contribution of each
source to an overlapped partial amplitude must be estimated. This can be done
using the amplitudes of non-overlapped neighbor partials (Klapuri, 2003a, Yeh
et al., 2005, Every and Szymanski, 2006), assuming smooth spectral envelopes,
or considering that the amplitude envelopes of different partials are correlated
in time (Woodruff et al., 2008).
In the proposed method, similarly to (Maher, 1990) and (Yeh et al., 2005),
the amplitudes of overlapped partials are estimated by linear interpolation of
the neighboring non-overlapped partials (see Fig. 7.4).
If there are more than two consecutive overlapped partials, then the
interpolation is done the same way with the non-overlapped values. For instance,
if harmonics 2 and 3 are overlapped, then the amplitudes of harmonics 1 and 4
are used to estimate them by linear interpolation.
130
7.2. JOINT ESTIMATION METHOD I
Like in other works, the method also assumes that a smooth spectral pattern
is more probable than an irregular one. To compute the smoothness σ of a
candidate, the HPS is first normalized dividing the amplitudes by the maximum
harmonic value in the HPS, obtaining p̄. Then, p̄ is low-pass filtered using a
truncated normalized Gaussian window N0,1 , which is convolved with the HPS
to obtain the smoothed version p̃:
p̃c = N0,1 ∗ p̄c 7.9
Only three components were chosen for the Gaussian window N =
{0.21, 0.58, 0.21}, due to the reduced size of the HPS11 .
11 Usually, only the first harmonics contain most of the energy of a harmonic source, therefore
131
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
1
p
0.8 ~
p
0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 h
1
p
0.8
~
p
0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 h
Figure 7.5: Spectral smoothness measure example. The normalized HPS vector
p and the smooth version p̃ of two candidates (c1 , c2 ) are shown. Sharpness
values are s(c1 ) = 0.13, and s(c2 ) = 1.23.
H
X
s(c) = (|p̃c,h − p̄c,h |) 7.10
h=1
s(c)
s̄(c) = 7.11
1 − N0,1 (x̄)
And finally, the smoothness σ(c) ∈ [0, 1] of a HPS is calculated as:
s̄(c)
σ(c) = 1 − 7.12
Hc
where Hc is the index of the last harmonic found for the candidate. This
parameter was introduced to prevent that high frequency candidates that have
less partials than those at low frequencies will have higher smoothness. This
way, the smoothness is considered to be more reliable when there are more
partials to estimate it.
132
7.3. JOINT ESTIMATION METHOD II
Once the smoothness and the intensity of each candidate have been
calculated, the salience S(Ci ) of a combination Ci with C candidates is:
C
X 2
S(Ci (t)) = [l(c) · σ κ (c)] 7.13
c=1
7.2.6 Postprocessing
After selecting the best combination at each individual frame, a last stage is
applied to remove some local errors taking into account the temporal dimension.
If a pitch was not detected in a target frame but it was found in the previous
and next frames, it is considered to be active in the current frame too, avoiding
some temporal discontinuities. Notes shorter than a minimum duration d are
also removed.
Finally, the sequences of consecutive detected fundamental frequencies are
converted into MIDI pitches. The maximum intensity of the entire song
candidates is used as reference to get the MIDI velocities, linearly mapping
the candidate intensities within the range [0, max∀C,c {l(c)}] into MIDI values
[0, 127].
133
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
complex mixture, even for expert musicians. As discussed in Sec. 3.1, context is
very important in music to disambiguate certain situations. The joint estimation
method II is an extension of the previous method, but considering information
about adjacent frames, similarly to the supervised learning method described
in Chapter 6, producing a smoothed detection across time.
t+K
X
S̃(Ci0 (t)) = S(Ci0 (j)) 7.14
j=t−K
This way, the saliences of the combinations with the same pitches than Ci0 in
the K adjacent frames are summed to obtain the salience at the target frame, as
shown in Fig. 7.6. The combination with maximum salience is finally selected
to get the pitches at the target frame t.
C 0 (t) = arg max{S̃(Ci0 (t))} 7.15
i
This new approach increases the robustness of the system in the data set
used for evaluation, and it allows to remove the minimum amplitude ε for a peak
to be a candidate, added in the previous approach to avoid local false positives.
If the selected combination at the target frame does not contain any pitch
(if there is not any candidate or if none of them can be identified as a pitch),
then a rest is yielded without evaluating the combinations in the K adjacent
frames.
This technique smoothes the detection in the temporal dimension. For a
visual example, let’s consider the smoothed intensity of a given candidate c0 as:
t+K
X
˜l(c0 (t)) = l(c0 (j)) 7.16
j=t−K
134
7.3. JOINT ESTIMATION METHOD II
C1! (t) ={C3, G4} S(C1! (t)) =2000 C2! (t) ={C3, G4} S̃(C2! (t)) =3200
C2! (t) ={C3} S(C2! (t)) =1800 C3! (t) ={G3, E3} S̃(C3! (t)) =1000
C3! (t) ={E3, G4} S(C3! (t)) =200 C4! (t) ={E3, G4} S̃(C4! (t)) =340
135
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
Figure 7.7: Top: Example of detected piano-roll for an oboe melody. Bottom:
Three-dimensional temporal representation of ˜l(c0 (t)) for the candidates of the
winner combination at each frame. In this example, all the pitches were correctly
detected. High temporal smoothness usually indicates good estimates.
When the temporal evolution of the smoothed intensities ˜l(c0 (t)) of the
winner combination candidates is plotted in a three-dimensional representation
(see Figs. 7.7 and 7.8), it can be seen that the correct estimates usually show
smooth temporal curves. An abrupt change (a sudden note onset or offset,
represented by a vertical cut in the smoothed intensities 3D plot) means that the
harmonic components of a given candidate were suddenly assigned to another
candidate in the next frame. Therefore, vertical lines in the plot usually indicate
errors mapping harmonic components with the candidates.
136
7.3. JOINT ESTIMATION METHOD II
137
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
Selected partial
0
fh − fr fh fh + fr f
Figure 7.9: Partial selection in the joint estimation method II. The selected
peak is the one with the greatest weighted value.
value is selected as a partial. The advantage of this scheme is that low amplitude
peaks are penalized and, besides the harmonic spectral location, intensity is also
considered to identify the most important spectral peaks with partials.
D(vi , vj )
w(vi , vj ) = 7.17
S(vj ) + 1
where S(vj ) is the salience of the combination at vertex vj and D(vi , vj ) is a
similarity measure for two combinations vi and vj , corresponding to the sum
of the absolute differences between the intensities of all the candidates in both
combinations:
X X X
D(vi , vj ) = |˜l(vi,c ) − ˜l(vj,c )| + ˜l(vi,c ) + ˜l(vj,c )
7.18
∀c∈vi ,vj ∀c∈vi −vj ∀c∈vj −vi
138
7.3. JOINT ESTIMATION METHOD II
24609.9
18211.3
{C3,C5}
t4n48n72 15991.3
{C3,C4}
t5n48n60
16638
18214.9
155.338
0
56220.1 {C3,C5}
t6n48n72
9770.75
14504.2
13916.6
7262.77 16014.2
Vi 0
95375.4
{C3,C5}
t8n48n72
32558.2
{C2,G4}
t9n35n60
83984.6
Using this scheme, the transition weight between two combinations considers
the salience of the target combination and the differences between the candidate
intensities.
Finally, the shortest path12 across the wDAG is found using the Dijkstra
(1959) algorithm13 . The vertices that belong to the shortest path are the winner
combinations yielded at each time frame.
1. Frame by frame analysis. All the frames are analyzed to yield the
estimates. This is the basic scheme of the joint estimation methods
previously described.
12 The path which minimizes the weights sum from the starting node to the final state.
13 The boost C++ library, available at http://www.boost.org, was used for this task.
139
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
2. To detect onsets and analyze only one frame between two onsets to yield
the pitches in the inter-onset interval. This scheme, used in the iterative
estimation method, increases the efficiency but with an accuracy cost. The
method relies on the onset detection results, therefore a wrong estimate
in the onset detection stage can affect the results.
3. To detect onsets and merge combinations of those frames that are between
two consecutive onsets, yielding the pitches for the inter-onset interval.
This technique can obtain more reliable results when the onsets are
correctly estimated, as it happens with piano sounds. However, merging
combinations between two frames reduce the number of detected notes,
as only combinations that are present in most of the IOI frames are
considered. Like in the previous scheme, the detection is very sensitive
to false negative onsets.
Dixon (2006).
15 Like it occurs in Fig. 1.1.
140
7.4. EVALUATION
7.4 Evaluation
To perform a first evaluation and set up the parameters, initial experiments were
done using a data set of random mixtures. Then, the three proposed approaches
were evaluated and compared with other works for real music transcription in
the MIREX (2007) and MIREX (2008) multiple f0 estimation and tracking
contests.
7.4.1 Parametrization
The parameters of the three proposed methods and their impact on the results
are analyzed in this section. The intention in the parametrization stage is
not to get the parameter values that maximize the accuracy for the test set
used, as the success rate is dependent on these particular data. However, this
stage can help to obtain a reasonable good parameter set of values and to
evaluate the impact of each parameter in the accuracy and the computational
cost. Therefore, the selected parameters are not always those that achieve the
highest accuracy in the test set, but those that obtain a close-to-best accuracy
keeping a low computational cost.
For the parametrization stage, a database of random pitch combinations has
been used. This database was generated using mixtures of musical instrument
samples with fundamental frequencies ranging between 40 and 2100 Hz. The
samples are the same used in the evaluation of the Klapuri (2006b) method.
The data set consists on 4000 mixtures with polyphony16 1, 2, 4, and 6. The
2842 audio samples from 32 musical instruments used to generate the mixtures
are from the McGill University master samples collection17 , the University of
Iowa18 , IRCAM studio online19 , and recordings of an acoustic guitar. In order
to respect the copyright restrictions, only the first 185 ms of each mixture20
were used for evaluation.
It is important to note that the data set only contains isolated pitch
combinations, therefore the evaluation of the parameters that have a temporal
dimension (like the minimum note duration) could not be evaluated using this
database. The test set is intended for evaluation of multiple f0 estimation at
single frames, therefore f0 tracking from joint estimation method II could not
be evaluated with these data.
To evaluate the parameters in the iterative cancellation method and in
the joint estimation method I, only one frame which is 43 ms apart from the
16 There are 1000 mixtures for each polyphony.
17 http://www.music.mcgill.ca/resources/mums/html/index.htm
18 http://theremin.music.uiowa.edu/MIS.html
19 http://forumnet.ircam.fr/402.html?&L=1
20 Thanks to A. Klapuri for providing this reduced data set for evaluation.
141
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
beginning of the mixture has been selected. For the joint estimation method II,
which requires more frames for merging combinations, all the frames (5) have
been used to select the best combination in the mixture.
The accuracy metric (Eq. 3.6) has been chosen as a success rate criterion
for parametrization. A candidate identification error rate was also defined for
adjusting the parameters that are related with the candidate selection stage.
This error rate is set as the number of actual pitches that are not present in the
candidate set divided by the number of actual pitches.
The overall results for the three methods using the random mixtures data
set and the selected parameters are described in Sec. 7.4.2, and the results of
the comparison with other multiple f0 estimation approaches are detailed in
Secs. 7.4.3 and 7.4.4.
The parameters chosen for the iterative cancellation method are shown in
Tab. 7.1.
In the SLM analysis (see pag. 121), the threshold τ was set to a very
low value, as it is preferable to have a noise peak than discarding a partial
in the preprocessing stage. Different bandwidth values for the SLM were
tested to find the optimal bandwidth W . However, the use of SLM did not
improve the accuracy respect to the systematic selection of all the spectral
peaks (see Fig. 7.11). This can be partially explained because in the test set
there were only harmonic components, making unnecessary to discard spurious
peaks. Besides this reason, the SLM method assumes that there are not two
sinusoidal components closer than W . In some cases, this assumption does not
hold in polyphonic real signals, where typical values of W ∈ [10, 50] Hz exceed
some partial frequency differences. Therefore, the SLM stage was removed to
obtain the results described in Sec. 7.4.2, and the spectral peaks have been
142
7.4. EVALUATION
Acc
0.4
SLM
Spectral peak picking
0.3
0.2 W
0
10 1
20 2
30 3
40 4
50
Figure 7.11: SLM accuracy respect to the bandwidth W using τ = 0.1, and
comparison with simple spectral peak picking. The other parameters used for
the evaluation are those described in Tab. 7.1.
systematically selected from the magnitude spectrum instead, as SML did not
improve neither the accuracy nor the efficiency with the tested values.
Experimentally, the use of all the spectral peaks yielded exactly the same
results than the selection of those spectral peaks with a magnitude over a low
fixed threshold µ = 0.1. This thresholding, which did not alter the results in the
test set, can reduce the computation time of the overall system to the half21 .
For this reason, this threshold was adopted and subsequently included in the
joint estimation methods.
The overall results without SLM can be seen in Fig. 7.12. In this figure,
the chosen parameter values are in the central intersection, and they correspond
to those described in Tab. 7.1. From these initial values, the parameters have
been changed individually to compare their impact in the accuracy.
The zero padding factor z is useful to accurately identify the frequency of
lower pitches. As shown in Fig. 7.12, the overall accuracy increases importantly
when using zero padding22 (z 6= 20 ). The computational cost derived from the
FFT computation of longer windows must also be taken into account. As the
overall computational cost of this method is very low, a value z = 8, which
slightly improves the accuracy, was chosen.
The range of valid fundamental frequencies comprises the f0 range of the
data set used for the evaluation, therefore it is the same for all the evaluated
methods.
The closest pitch distance value matches the spectral resolution obtained
with zero padding. This way, using a margin of fd = 3 Hz, only spectral
21 Experimentally, the running time was reduced from 146.05 to 73.4 seconds.
22 Due to the FFT constraints, only power of 2 values for z have been tested.
143
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
z
fd
γ
η
z = 23 γ=9 fd = 8
γ=0
η = 0.25
η=0
fd = 1
z = 20
peaks at ±1 bin from the ideal pitch frequency are considered as f0 candidates.
This parameter increases the accuracy about a 1%. However, as it can be seen
in Fig. 7.12, the value selected for this parameter (fd = 3) is probably too
much restrictive, and a higher range (fd = 5) yields better results. It must be
considered that the iterative cancellation method was developed before having
the random mixtures database, therefore their parameters are not optimally
tuned for this data set. However, experimentally, the accuracy deviation shows
that the chosen values do not differ much from the ones that approximate to
the highest accuracy using this data set.
144
7.4. EVALUATION
The parameters used for the joint estimation I method are shown in Tab. 7.2,
and the impact in the accuracy when they vary can be seen in Fig. 7.14.
A peak picking threshold µ = 0.1 was chosen, like in the iterative cancellation
method. This parameter increased the efficiency with a very low accuracy cost24 .
Like in the iterative cancellation method, the zero padding factor has shown
to be very relevant to increase the accuracy (see Fig. 7.14). A trade-off value
z = 4, instead of z = 8 used in iterative cancellation method, was chosen to
avoid increasing significantly the computational cost (see Fig. 7.15), which is
higher in this method.
The minimum f0 bin amplitude ε = 2 slightly increases the accuracy and
decreases the candidate selection error (see Fig. 7.13). A higher accuracy was
obtained with ε = 5, but note that this parameter must have a lower value
for the analysis of real musical signals25 , so this more conservative value was
selected instead.
23 It is assumed that, with lower intensities, notes are usually masked by the amplitude of
the other pitches in the mixture, and they can hardly be perceived.
24 Experimentally, in this method the accuracy decreased from 0.553 to 0.548.
25 Unlike in real music, in the test set all the signals had similar and high amplitudes.
145
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
z
ε
fr
F
F =6 z = 20
fr = 8
ε=0
fr = 16
z = 23 ε=5
F = 14
Figure 7.13: Joint estimation method I candidate error rate adjusting the
parameters that have some influence in the candidate selection stage.
z
ε
fr
F
z = 23 H
ε=5 κ
fr = 16
H = 15
F =6 fr = 8 κ=4 F = 14
ε=0
κ=0
H=5
z = 20
Figure 7.14: Joint estimation method I accuracy adjusting the free parameters.
146
7.4. EVALUATION
z
F
H
F = 14
z = 23 H = 14
H=5
F =6
z = 20
The bandwidth for searching partials fr does not seem to have a great
impact in the accuracy, but it is important in the candidate selection stage
(see Fig. 7.13). An appropriate balance between a high accuracy and a low
candidate selection error rate was obtained using fr = 11 Hz.
The computational cost increases exponentially with the number of candi-
dates F . Therefore, a good choice of F is critical for the efficiency of the method.
Experimentally, F = 10 yielded a good trade-off between the accuracy, the
number of correctly selected candidates and the computational cost (see Figs.
7.13, 7.14 and 7.15).
As previously mentioned, the first partials usually contain most of the energy
of the harmonic sounds. Experimentally, using H = 10 suffices, and higher
values cause low pitches to cancel other higher frequency components. In
addition, note that the computational cost linearly increases with respect to
H.
The smoothness weight which maximizes the accuracy was experimentally
found using κ = 2. It is important to note that without considering spectral
smoothing (κ = 0), the accuracy decreases significantly (see Fig. 7.14).
The postprocessing parameter values for γ and η were selected with the same
values than in the iterative cancellation approach.
147
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
The parameters selected for the joint estimation method II are shown in Tab.
7.3. Most of them are the same than in the method I, except from H, η and κ,
which yielded better results with slightly different values (see Fig. 7.16). In the
case of H and κ, the values that maximized the accuracy were selected.
Like in the joint estimation method I, the parameter η = 0.15 has been set
with a conservative value in order to avoid that the system performs worse for
real musical signals, which usually do not have very similar intensities for the
different sounds.
The postprocessing parameters can not be directly evaluated with this
data set, as they have a temporal dimension and each mixture is composed
of a single combination of pitches. However, a value K = 2 (considering 2
previous frames, 2 posterior frames and the target frame) has proven to be
adequate for the analysis of real musical signals out from the data set. This
value provides a notable temporal smoothness without altering significantly the
temporal resolution required for short notes.
148
7.4. EVALUATION
H
κ
η
η = 0.3
H = 17
κ=6
H = 10 η=0
κ=0
149
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
1
Iterative
Joint I
Joint II
0.8
0.6
0.4
0.2
0
01 12 24 36 4
Figure 7.17: Candidate identification error rate with respect to the polyphony
(1, 2, 4 and 6 simultaneous pitches) of the ground truth mixtures.
1
Iterative
Joint I
Joint II
0.8
0.6
0.4
0.2
0
0
150
7.4. EVALUATION
1
Precision
Recall
Accuracy
0.8
0.6
0.4
0.2
0
01 12 24 36
Figure 7.19: Pitch detection results for the iterative cancellation method with
respect to the ground-truth mixtures polyphony.
1
Precision
Recall
Accuracy
0.8
0.6
0.4
0.2
0
01 12 24 36
Figure 7.20: Pitch detection results for the joint estimation method I with
respect to the ground-truth mixtures polyphony.
151
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
1
Precision
Recall
Accuracy
0.8
0.6
0.4
0.2
0
01 12 24 36
Figure 7.21: Pitch detection results for the joint estimation method II with
respect to the ground-truth mixtures polyphony.
1
Precision
Recall
Accuracy
0.8
0.6
0.4
0.2
0
0
Iterative cancellation Joint1method I Joint 2method II 3
Figure 7.22: Comparison of the global pitch detection results for the three
methods.
152
7.4. EVALUATION
1000
Iterative
Joint I
Joint II
800
600
400
200
0
0 1 2 3 4 5 6 7 8 9
1000
Iterative
Joint I
Joint II
800
600
400
200
0
0 1 2 3 4 5 6 7 8 9
153
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
1000
Iterative
Joint I
Joint II
800
600
400
200
0
0 1 2 3 4 5 6 7 8 9
1000
Iterative
Joint I
Joint II
800
600
400
200
0
0 1 2 3 4 5 6 7 8 9
154
7.4. EVALUATION
1
Iterative
Joint I
Joint II
0.8
0.6
0.4
0.2
0
0
Figure 7.28: Precision, recall and accuracy of the iterative cancellation method
in function of the MIDI pitch number.
155
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
Figure 7.29: Precision, recall and accuracy of the joint estimation method I in
function of the MIDI pitch number.
Figure 7.30: Precision, recall and accuracy of the joint estimation method II
in function of the MIDI pitch number.
156
7.4. EVALUATION
157
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
Table 7.4: MIREX (2007) note tracking runtimes. Participant, running time
(in seconds), and machine where the evaluation was performed are shown.
The joint estimation method I was submitted for evaluation in MIREX (2007)
frame by frame and note tracking contests with the parameters specified in
Tab. 7.2.
The results for the frame by frame analysis are shown in Tab. 7.6, and the
corresponding runtimes in Tab. 7.7. The accuracy of this method was close28
to the highest accuracy among the evaluated methods, being the one with the
highest precision and the lowest Etot error. The precision, recall and accuracy
were slightly better than those obtained with the random mixtures database.
Probably, this is because the random mixtures database spans a pitch range
wider than the MIREX data set, and very low or very high frequency pitches
are those harder to detect (see Fig. 7.29).
27 Evaluation for onset, offset and pitch were also done in MIREX (2007), but results are
not reported in this work, as the iterative estimation system does not consider offsets.
28 About 0.025 lower.
158
id Participant Method Avg. F-m Prec Rec Avg. Overlap
RK Ryynänen and Klapuri (2005) Iterative cancellation + HMM tracking 0.614 0.578 0.678 0.699
EV4 Vincent et al. (2007) Unsupervised learning (NMF) 0.527 0.447 0.692 0.636
PE2 Poliner and Ellis (2007a) Supervised learning (SVM) 0.485 0.533 0.485 0.740
EV3 Vincent et al. (2007) Unsupervised learning (NMF) 0.453 0.412 0.554 0.622
PI2 Pertusa and Iñesta (2008a) Joint estimation method I 0.408 0.371 0.474 0.665
KE4 Kameoka et al. (2007) Statistical spectral models 0.268 0.263 0.301 0.557
KE3 Kameoka et al. (2007) Statistical spectral models 0.246 0.216 0.323 0.610
PI3 Lidy et al. (2007) Iterative cancellation 0.219 0.203 0.296 0.628
VE2 Emiya et al. (2007, 2008b) Joint estimation + Bayesian models 0.202 0.338 0.171 0.486
AC4 Cont (2007) Unsupervised learning (NMF) 0.093 0.070 0.172 0.536
AC3 Cont (2007) Unsupervised learning (NMF) 0.087 0.067 0.137 0.523
Table 7.5: MIREX (2007) note tracking results based on onset and pitch. Average F-measure, precision, recall, and average
overlap are shown for each participant.
159
7.4. EVALUATION
160
id Participant Method Acc Prec Rec Etot Esubs Emiss Ef a
RK Ryynänen and Klapuri (2005) Iterative cancellation + HMM tracking 0.605 0.690 0.709 0.474 0.158 0.133 0.183
CY Yeh (2008) Joint estimation 0.589 0.765 0.655 0.460 0.108 0.238 0.115
ZR Zhou et al. (2009) Salience function (RTFI) 0.582 0.710 0.661 0.498 0.141 0.197 0.160
PI1 Pertusa and Iñesta (2008a) Joint estimation method I 0.580 0.827 0.608 0.445 0.094 0.298 0.053
EV2 Vincent et al. (2007) Unsupervised learning (NMF) 0.543 0.687 0.625 0.538 0.135 0.240 0.163
CC1 Cao et al. (2007) Iterative cancellation 0.510 0.567 0.671 0.685 0.200 0.128 0.356
SR Raczynski et al. (2007) Unsupervised learning (NNMA) 0.484 0.614 0.595 0.670 0.185 0.219 0.265
EV1 Vincent et al. (2007) Unsupervised learning (NMF) 0.466 0.659 0.513 0.594 0.171 0.371 0.107
PE1 Poliner and Ellis (2007a) Supervised learning (SVM) 0.444 0.734 0.505 0.639 0.120 0.375 0.144
PL Leveau (2007) Matching pursuit 0.394 0.689 0.417 0.639 0.151 0.432 0.055
CC2 Cao et al. (2007) Iterative cancellation 0.359 0.359 0.767 1.678 0.232 0.001 1.445
KE2 Kameoka et al. (2007) Statistical spectral models (HTC) 0.336 0.348 0.546 1.188 0.401 0.052 0.734
KE1 Kameoka et al. (2007) Statistical spectral models (HTC) 0.327 0.335 0.618 1.427 0.339 0.046 1.042
AC2 Cont (2007) Unsupervised learning (NMF) 0.311 0.373 0.431 0.990 0.348 0.221 0.421
AC1 Cont (2007) Unsupervised learning (NMF) 0.277 0.298 0.530 1.444 0.332 0.138 0.974
VE Emiya et al. (2007, 2008b) Joint estimation + Bayesian models 0.145 0.530 0.157 0.957 0.070 0.767 0.120
Table 7.6: MIREX (2007) frame by frame evaluation results. Accuracy, precision, recall, and the error metrics proposed by
Poliner and Ellis (2007a) are shown for each participant.
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
7.4. EVALUATION
Table 7.7: MIREX (2007) frame by frame runtimes. The first column shows
the participant, the second is the runtime and the third column is the machine
where the evaluation was performed. ALE Nodes was the fastest machine.
The method was also evaluated in the note tracking contest. Despite it was
not designed for this task, as the analysis is performed without information of
neighboring frames but converting consecutive pitch detections into notes, the
results were not bad, as shown in Tab. 7.5.
The joint estimation method I was also very efficient respect to the other
state of the art methods presented (see Tab. 7.7), specially considering that it
is a joint estimation approach.
The joint estimation method II was submitted to MIREX (2008) for frame by
frame and note tracking evaluation. The method was presented for both tasks
in two setups: with and without f0 tracking.
The difference between using f0 tracking or not is the postprocessing stage
(see Tab. 7.3). In the first setup, notes shorter than a minimum duration are
just removed, and when there are short rests between two consecutive notes
of the same pitch, the notes are merged. Using f0 tracking, the methodology
described in Sec. 7.3.3 is performed instead, increasing the temporal coherence
of the estimate with the wDAG.
Experimentally, the joint estimation method II was very efficient compared
to the other approaches presented, as shown in Tabs. 7.8 and 7.9.
The results for the frame by frame task can be seen in Tab. 7.10. The
accuracy for the joint estimation method II without f0 tracking is satisfactory,
161
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
Table 7.8: MIREX (2008) frame by frame runtimes. Participants and runtimes
are shown. All the methods except MG were evaluated using the same machine.
Table 7.9: MIREX (2008) note tracking runtimes. Participants and runtimes
are shown. All the methods except ZR were evaluated using the same machine.
and the method obtained the highest precision and the lowest Etot error among
all the analyzed approaches.
The inclusion of f0 tracking did not improve the results for frame by
frame estimation, but in the note tracking task (see Tab. 7.11), the results
outperformed those obtained without tracking.
162
id Participant Method Acc Prec Rec Etot Esubs Emiss Ef a
YRC2 Yeh et al. (2008) Joint estimation + f0 tracking 0.665 0.741 0.780 0.426 0.108 0.127 0.190
YRC1 Yeh et al. (2008) Joint estimation 0.619 0.698 0.741 0.477 0.129 0.129 0.218
PI2 Pertusa and Iñesta (2008b) Joint estimation II 0.618 0.832 0.647 0.406 0.096 0.257 0.053
RK Ryynänen and Klapuri (2005) Iterative cancellation + HMM tracking 0.613 0.698 0.719 0.464 0.151 0.130 0.183
PI1 Pertusa and Iñesta (2008b) Joint estimation II + tracking 0.596 0.824 0.625 0.429 0.101 0.275 0.053
VBB Vincent et al. (2007) Unsupervised learning (NMF) 0.540 0.714 0.615 0.544 0.118 0.267 0.159
DRD Durrieu et al. (2008) Iterative cancellation 0.495 0.541 0.660 0.731 0.245 0.096 0.391
CL2 Cao and Li (2008) Iterative cancellation 0.487 0.671 0.560 0.598 0.148 0.292 0.158
EOS Egashira et al. (2008) Statistical spectral models (HTC) 0.467 0.591 0.546 0.649 0.210 0.244 0.194
EBD2 Emiya et al. (2008a) Joint estimation + Bayesian models 0.452 0.713 0.493 0.599 0.146 0.362 0.092
EBD1 Emiya et al. (2008a) Joint estimation + Bayesian models 0.447 0.674 0.498 0.629 0.161 0.341 0.127
MG Groble (2008) Database matching 0.427 0.481 0.570 0.816 0.298 0.133 0.385
CL1 Cao and Li (2008) Iterative cancellation 0.358 0.358 0.763 1.680 0.236 0.001 1.443
RFF1 Reis et al. (2008a) Supervised learning (genetic) 0.211 0.506 0.226 0.854 0.183 0.601 0.071
RFF2 Reis et al. (2008a) Supervised learning (genetic) 0.183 0.509 0.191 0.857 0.155 0.656 0.047
Table 7.10: MIREX (2008) frame by frame evaluation results. Accuracy, precision, recall, and the error metrics proposed by
Poliner and Ellis (2007a) are shown for each method.
163
7.4. EVALUATION
164
id Participant Method Avg. F-m Prec Rec Avg. Overlap
YRC Yeh et al. (2008) Joint estimation + f0 tracking 0.355 0.307 0.442 0.890
RK Ryynänen and Klapuri (2005) Iterative cancellation + HMM tracking 0.337 0.312 0.382 0.884
ZR3 Zhou and Reiss (2008) Salience function (RTFI) 0.278 0.256 0.314 0.874
ZR2 Zhou and Reiss (2008) Salience function (RTFI) 0.263 0.236 0.306 0.874
ZR1 Zhou and Reiss (2008) Salience function (RTFI) 0.261 0.233 0.303 0.875
PI1 Pertusa and Iñesta (2008b) Joint estimation II + tracking 0.247 0.201 0.333 0.862
EOS Egashira et al. (2008) Statistical spectral models (HTC) 0.236 0.228 0.255 0.856
VBB Vincent et al. (2007) Unsupervised learning (NMF) 0.197 0.162 0.268 0.829
PI2 Pertusa and Iñesta (2008b) Joint estimation II 0.192 0.145 0.301 0.854
EBD1 Emiya et al. (2008a) Joint estimation + Bayesian models 0.176 0.165 0.200 0.865
EBD2 Emiya et al. (2008a) Joint estimation + Bayesian models 0.158 0.153 0.178 0.845
RFF2 Reis et al. (2008a) Supervised learning (genetic) 0.032 0.037 0.030 0.645
RFF1 Reis et al. (2008a) Supervised learning (genetic) 0.028 0.034 0.025 0.683
Table 7.11: MIREX (2008) note tracking results based on onset, offset, and pitch. Average F-measure, precision, recall, and
average overlap are shown for each method.
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
7.4. EVALUATION
Figure 7.31: Fig. from Bay et al. (2009), showing Esubs , Emiss and Ef a for
all MIREX 2007 and MIREX 2008 multiple fundamental frequency estimation
methods ordered by Etot . PI2-08 is the joint estimation method II without
tracking, PI1-08 is the same method with tracking, and PI-07 is the joint
estimation method I.
165
166
Figure 7.32: Fig. from Bay et al. (2009). Precision, recall and overall accuracy for all MIREX 2007 and MIREX 2008 multiple
fundamental frequency estimation methods ordered by accuracy. PI2-08 is the joint estimation method II without tracking,
PI1-08 is the same method with tracking, and PI-07 is the joint estimation method I.
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
7.4. EVALUATION
Figure 7.33: Fig. from Bay et al. (2009). Precision, recall, average F-measure
and average overlap based on note onset for MIREX 2007 and MIREX 2008 note
tracking subtask. PI2-08 is the joint estimation method II without tracking,
PI1-08 is the same method with tracking, PI1-07 is the joint estimation method
I and PI2-07 is the iterative cancellation method.
that most of the reported f0 were correct, but multiple f0 estimation algorithms
tend to under-report and miss many active f0 in the ground-truth.
While the proposed joint estimation methods I and II achieved the lowest
Etot score, there are very few false alarms compared to miss errors. On the other
hand, the methods from Ryynänen and Klapuri (2005) and Yeh et al. (2008)
have a better balanced precision, recall, as well as a good balance in the three
error types, and as a result, have the highest accuracies for MIREX (2007) and
MIREX (2008), respectively.
Citing Bay et al. (2009), ”Inspecting the methods used and their perfor-
mances, we can not make generalized claims as to what type of approach works
best. In fact, statistical significance testing showed that the top three methods29
were not significantly different.”
29 (Yeh et al., 2008; Pertusa and Iñesta, 2008b; Ryynänen and Klapuri, 2005).
167
7. MULTIPLE F0 ESTIMATION USING SIGNAL PROCESSING METHODS
7.5 Conclusions
In this chapter, three different signal processing methods have been proposed for
multiple f0 estimation. Unlike the supervised learning approaches previously
described, these signal processing schemes can be used to transcribe real music
without any a-priori knowledge of the sources.
The first method is based on iterative cancellation, and it is a simple
approach which is mainly intended for the transcription of piano sounds at a low
computational cost. For this reason, only one frame in an inter-onset interval
is analyzed, and the interaction between harmonic sources is not considered.
A fixed spectral pattern is used to subtract the harmonic components of each
candidate.
The joint estimation method I introduces a more complex methodology.
The spectral patterns are inferred from the analysis of different hypotheses
taking into account the interactions with the other sounds. The combination
of harmonic patterns that maximizes a criterion based on the sum of harmonic
amplitudes and spectral envelope smoothness is chosen at each frame.
The third method extends the previous joint estimation method considering
adjacent frames for adding temporal smoothing. This method can be com-
plemented with a f0 tracking stage, using a weighted direct acyclic graph, to
increase the temporal coherence of the detection.
The proposed methods have been evaluated and compared to other works.
The iterative cancellation approach, mainly intended for piano transcription, is
very efficient and it has been successfully used for genre classification and other
MIR tasks (Lidy et al., 2007) with computational cost restrictions.
The joint estimation methods obtained a high accuracy and the lowest Etot
among all the multiple f0 algorithms submitted in MIREX (2007) and MIREX
(2008). Although all possible combinations of candidates are evaluated at each
frame, the proposed approaches have a very low computational cost, showing
that it is possible to make an efficient joint estimation method.
Probably, the f0 tracking stage added to the joint estimation method II is
too simple, and it should be replaced by a more reliable method in a future
work. For instance, the transition weights could be learned from a labeled test
set, or a more complex f0 tracking method like the high-order HMM scheme
from Chang et al. (2008) could be used instead. Besides intensity, the centroid
of an HPS should also have a temporal coherence when belonging to the same
source, therefore this parameter could also be considered for tracking.
Using stochastic models, a probability can be assigned to each pitch in order
to remove those that are less probable given their context. For example, in
a melodic line it is very unlikely that a non-diatonic note two octaves higher
or lower than its neighbours appears. Musical probabilities can be taken into
168
7.5. CONCLUSIONS
account, like in (Ryynänen and Klapuri, 2005), to remove very unprobable notes.
The adaptation to polyphonic music of the stochastic approach from Pérez-
Sancho (2009) is also planned as future work, in order to use it in the multiple
f0 estimation methods to obtain a musically coherent detection.
The evaluation and further research of the alternative architectures proposed
for the joint estimation method II (see Sec. 7.3.4) is also left for future work.
169
Conclusions and future work
8
This work has addressed the automatic music transcription problem using
different strategies. Efficient novel methods have been proposed for onset
detection and multiple f0 estimation, using supervised learning and signal
processing techniques. The main contributions of this work can be summarized
in the following points:
• An extensive review of the state of the art methods for onset detection
and multiple f0 estimation. The latter methods have been classified
into salience functions, iterative cancellation, joint estimation, supervised
learning, unsupervised learning, matching pursuit, Bayesian models,
statistical spectral models, blackboard systems, and database matching
methods. An analysis of the strengths and limitations for each category
has also been done.
171
8. CONCLUSIONS AND FUTURE WORK
172
8.1. DISCUSSION AND FUTURE LINES OF WORK
173
8. CONCLUSIONS AND FUTURE WORK
8.2 Publications
Some contents of this thesis have been published in journals and conference
proceedings. Here is a list of publications in chronological order.
• Lidy, T., Rauber, A., Pertusa, A., and Iñesta, J. M. (2007). Improving
genre classification by combination of audio and symbolic descriptors using
a transcription system. In Proc. of the 8th International Conference
on Music Information Retrieval (ISMIR), pages 61-66, Vienna, Austria.
[Chapter 7]
• Lidy, T., Rauber, A., Pertusa, A., Ponce de León, P. J., and Iñesta, J.
M. (2008). Audio music classification using a combination of spectral,
timbral, rhythmic, temporal and symbolic features. In MIREX (2008),
audio genre classification contest, Philadelphia, PA. [Chapter 7]
174
8.2. PUBLICATIONS
• Lidy, T., Grecu, A., Rauber, A., Pertusa, A., Ponce de León, P. J., and
Iñesta, J. M. (2009). A multi-feature multi-classifier ensemble approach
for audio music classification. In MIREX (2009), audio genre classification
contest, Kobe, Japan. [Chapter 7]
175
Summary in Spanish required by
A
the regulations of the University of
Alicante.
Resumen
Agradecimientos
Antes de nada, me gustarı́a agradecer a todos los miembros del grupo de música
por ordenador de la Universidad de Alicante por proporcionar una excelente
atmósfera de trabajo. Especialmente, al coordinador del grupo y supervisor de
este trabajo, José Manuel Iñesta. Su incansable espı́ritu cientı́fico proporciona
un marco de trabajo excelente para inspirar las nuevas ideas que nos hacen
crecer y avanzar continuamente. Este trabajo no hubiera sido posible sin su
consejo y ayuda.
Escribir una tesis no es una tarea fácil sin la ayuda de mucha gente.
Primero, me gustarı́a agradecer a toda la plantilla de nuestro Grupo de
Reconocimiento de Formas e Inteligencia Artificial (GRFIA) y, en general,
a todo el Departamento de Lenguajes y Sistemas Informáticos (DLSI) de la
Universidad de Alicante. Mis estancias de investigación con el Audio Research
Group, Tampere University of Technology (Tampere), Music Technology Group
(MTG), Universitat Pompeu Fabra (Barcelona) y Department of Software
Technology and Interactive Systems, Vienna University of Technology (Viena),
también han contribuido notablemente a la realización de este trabajo. He
crecido mucho, como cientı́fico y como persona, aprendiendo de los integrantes
de estos centros de investigación.
También me gustarı́a agradecer a la gente que ha contribuido directamente a
este trabajo. A Francisco Moreno, por retrasar algunas de mis responsabilidades
docentes durante la escritura de este documento y por proporcionar el código
de los algoritmos de k vecinos más cercanos. He aprendido la mayorı́a de
las técnicas que conozco para transcripción musical de Anssi Klapuri. Estaré
eternamente agradecido por los grandes momentos que pasé en Tampere y por su
generosa acogida. Anssi ha contribuido directamente a esta tesis proporcionando
las bases para el código de similitud sinusoidal y la base de datos de acordes
aleatorios que han posibilitado la evaluación y la mejora de los algoritmos
177
A. RESUMEN
1- Introducción6
La transcripción musical automática consiste en extraer las notas que están
sonando (la partitura) a partir de una señal de audio digital. En el caso de la
transcripción polifónica, se parte de señales de audio que pueden contener varias
notas sonando simultáneamente.
Una partitura es una guı́a para interpretar información musical, y por tanto
puede representarse de distintas maneras. La representación más extendida
es la notación moderna usada en música tonal occidental. Para extraer una
representación comprensible en dicha notación, además de las notas, sus tiempos
de inicio y sus duraciones, es necesario indicar el tempo, la tonalidad y la
métrica.
La aplicación más obvia de la extracción de la partitura es ayudar a un
músico a escribir la notación musical a partir del sonido, lo cual es una tarea
complicada cuando se hace a mano. Además de esta aplicación, la transcripción
automática también es útil para otras tareas de recuperación de información
musical, como detección de plagios, identificación de autor, clasificación de
género, y asistencia a la composición cambiando la instrumentación o las notas
para generar nuevas piezas musicales a partir de una ya existente. En general,
1 Código TIN2006-14932-C02
2 Código CSD2007-00018
3 Código TIC2000-1703-CO3-02
4 Código TIC2003-08496-C04
6 Introduction.
178
estos algoritmos también pueden proporcionar información sobre las notas para
aplicar métodos que trabajan sobre música simbólica.
La transcripción musical automática es una tarea de recuperación de
información musical en la que están implicadas varias disciplinas, tales como
el procesamiento de señales, el aprendizaje automático, la informática, la
psicoacústica, la percepción musical y la teorı́a musical.
Esta diversidad de factores provoca que haya muchas formas de abordar
el problema. La mayorı́a de trabajos previos han utilizado diversos enfoques
dentro del campo del procesamiento de la señal, aplicando metodologı́as para
el análisis en el dominio de la frecuencia. En la literatura podemos encontrar
múltiples algoritmos de separación de señales, sistemas que emplean algoritmos
de aprendizaje y clasificación para detectar las notas, enfoques que consideran
modelos psicoacústicos de la percepción del sonido, o sistemas que aplican
modelos musicológicos como medida de coherencia de la detección.
La parte principal de un sistema de transcripción musical es el sistema
de detección de frecuencias fundamentales, que determina el número de notas
que están sonando en cada instante, sus alturas y sus tiempos de activación.
Además del sistema de detección de frecuencias fundamentales, para obtener
la transcripción completa de una pieza musical es necesario estimar el tempo
a través de la detección de pulsos musicales, y obtener el tipo de compás y la
tonalidad.
La transcripción polifónica es una tarea compleja que, hasta el momento,
no ha sido resuelta de manera eficaz para todos los tipos de sonidos armónicos.
Los mejores sistemas de detección de frecuencias fundamentales obtienen unos
porcentajes de acierto del 60%, aproximadamente. Se trata, principalmente,
de un problema de descomposición de señales en una mezcla, lo cual implica
conocimientos avanzados sobre procesamiento de señales digitales, aunque
debido a la naturaleza del problema también intervienen factores perceptuales,
psicoacústicos y musicológicos.
El proceso de transcripción puede separarse en dos tareas: convertir una
señal de audio en una representación de pianola, y convertir la pianola estimada
en notación musical.
Muchos autores sólo consideran la transcripción automática como una
conversión de audio a pianola, mientras que la conversión de pianola a notación
musical se suele ver como un problema distinto. La principal razón de esto es
que los procesos involucrados en la extracción de una pianola incluyen detección
de alturas y segmentación temporal de las notas, lo cual es una tarea ya de por
sı́ muy compleja. La conversión de pianola a partitura implica estimar el tempo,
cuantizar el ritmo o detectar la tonalidad. Esta fase está más relaccionada con
la generación de una notación legible para los músicos.
179
A. RESUMEN
180
Las principales contribuciones de este trabajo son un conjunto de nuevos
métodos eficientes propuestos para la estimación de frecuencias fundamentales.
Estos algoritmos se han evaluado y comparado con otros métodos, dando buenos
resultados con un coste computacional muy bajo.
La detección de los comienzos de eventos musicales en señales de audio, o
detección de onsets, también se ha abordado en este trabajo, desarrollando un
método simple y eficiente para esta tarea. La información sobre onsets puede
usarse para estimar el tempo o para refinar un sistema de detección de alturas.
Los métodos propuestos se han aplicado a otras tareas de recuperación de
información musical, tales como clasificación de género, de modo, o identificación
de la autorı́a de una obra. Para ello, se ha combinado caracterı́sticas de audio
con caracterı́sticas simbólicas extraidas mediante la transcripción de señales
musicales, usando métodos de aprendizaje automático para obtener el género,
modo o autor.
2- Conocimientos previos8
Este capı́tulo introduce los conceptos necesarios para la adecuada comprensión
del trabajo. Se describen los conceptos y términos relacionados con métodos de
procesamiento de la señal, teorı́a musical y aprendizaje automático.
Primero, se hace una breve introducción de las distintas técnicas para el
análisis de señales de audio basadas en la transformada de Fourier, incluyendo
diferentes representaciones de tiempo-frecuencia.
A continuación, se analizan las propiedades de las señales musicales, y se
clasifican los instrumentos con respecto a su mecanismo de generación del sonido
y a sus caracterı́sticas espectrales.
También se abordan los conceptos necesarios sobre teorı́a musical, descri-
biendo las estructuras temporales y armónicas de la música occidental y su
representación usando notación escrita y computacional.
Finalmente, se describen las técnicas basadas en aprendizaje automático que
se han usado en este trabajo (redes neuronales y k vecinos más cercanos).
3 - Transcripción musical10
Este capı́tulo describe brevemente algunas caracterı́sticas perceptuales rela-
cionadas con el proceso que sigue un músico para realizar una transcripción
musical. Seguidamente, se analizan las limitaciones teóricas de la transcripción
automática desde un punto de vista del análisis y procesamiento de señales
discretas.
8 Background.
10 Music transcription.
181
A. RESUMEN
4 - Estado de la cuestión12
182
5 - Detección de onsets usando un banco de filtros
armónicos14
En este capı́tulo se propone un nuevo método para detección de onsets. La señal
de audio se analiza usando un banco de filtros pasa-banda de un semitono,
y se emplean las derivadas temporales de los valores filtrados para detectar
variaciones espectrales relacionadas con el inicio de los eventos musicales.
Este método se basa en las caracterı́sticas de los sonidos armónicos. Los
primeros cinco armónicos de un sonido afinado coinciden con las frecuencias de
otras notas en la afinación bien temperada usada en la música occidental. Otra
caracterı́stica importante de estos sonidos es que normalmente la mayor parte
de su energı́a se concentra en los primeros armónicos.
El banco de filtros de un semitono está formado por un conjunto de filtros
triangulares cuyas frecuencias centrales coinciden con las alturas musicales. En
la fase de sostenimiento y relajación de una nota, puede haber ligeras variaciones
en la intensidad y en la frecuencia de los armónicos. En este escenario, la
comparación espectral directa puede generar falsos positivos.
En cambio, usando el banco de filtros propuesto, se minimizan los efectos
de las variaciones espectrales sutiles que se producen durante las fases de
sostenimiento y relajación de una nota, mientras que en el ataque se incrementan
significativamente las amplitudes filtradas, ya que la mayor parte de energı́a de
los parciales se concentra en las frecuencias centrales de estas bandas. De este
modo, el sistema es especialmente sensible a variaciones de la frecuencia mayores
de un semitono, y por tanto se tiene en cuenta las propiedades armónicas de los
sonidos.
El método se ha evaluado y comparado con otros trabajos, dando buenos
resultados dada su sencillez, y obteniendo una alta eficiencia. El algoritmo,
desarrollado en C++, y la base de datos etiquetada para su evaluación se han
hecho públicos para futuras investigaciones.
183
A. RESUMEN
184
basados en métodos de procesamiento de la señal, evitando ası́ la necesidad de
emplear un conjunto de entrenamiento.
El primero de estos métodos es un algoritmo de cancelación iterativa,
principalmente enfocado a la transcripción de sonidos de cuerdas percutidas.
El principal objetivo de este sistema es obtener una estimación básica de las
frecuencias fundamentales presentes en señales reales, manteniendo un bajo
coste computacional. Este método se ha integrado con éxito en un sistema
más complejo para clasificación de géneros musicales.
Además del sistema de cancelación iterativa, se han propuesto dos nuevos
métodos de estimación conjunta que son capaces de tener en cuenta las
interacciones entre armónicos. Este tipo de métodos suele tener un alto coste
computacional debido a la evaluación de muchas posibles combinaciones de
alturas. Sin embargo, los métodos propuestos son muy eficientes. Estos métodos
se han evaluado y comparado con otros trabajos en MIREX 2007 y MIREX 2008,
obteniendo excelentes resultados con tiempos de ejecución muy bajos.
El primero de estos métodos realiza un análisis de la señal por ventanas,
obteniendo un conjunto de alturas en cada instante. Para ello, primero se
identifican los posibles candidatos a partir de los picos espectrales, y se generan
todas las combinaciones de candidatos para que un algoritmo de estimación
conjunta busque la mejor combinación teniendo en cuenta las interacciones entre
armónicos.
Para evaluar una combinación, se construye una secuencia hipotética de
parciales (HPS) para cada candidato. La puntuación de un candidato se
calcula considerando la suma de las amplitudes de los armónicos de su HPS
y una medida de suavidad de la envolvente espectral. La puntuación de una
combinación se calcula como la suma al cuadrado de las puntuaciones de sus
candidatos, y se selecciona la combinación de mayor puntuación en la ventana
actual.
El método asume que las envolventes espectrales de los sonidos analizados
tienden a variar suavemente en función de la frecuencia. El principio de suavidad
espectral se ha usado anteriormente (aunque de distinta manera) en trabajos
previos. La nueva medida de suavidad espectral está basada en la convolución
de la secuencia hipotética de parciales con una ventana gausiana.
Dada una combinación, la HPS de cada candidato se calcula teniendo en
cuenta las interacciones entre los armónicos de todos los candidatos de la
combinación. Para ello, primero se identifican los parciales solapados y se
estiman sus amplitudes por interpolación lineal usando las amplitudes de los
armónicos no solapados.
A diferencia del método de cancelación iterativa previamente descrito,
que asume un patrón armónico constante, el método de esimación conjunta
puede inferir patrones armónicos hipotéticos a partir de los datos espectrales,
185
A. RESUMEN
186
• Una revisión exhaustiva del estado de la cuestión para detección de
onsets y de frecuencias fundamentales en señales polifónicas. Los métodos
existentes se han clasificado en funciones prominentes (salience functions),
cancelación iterativa (iterative cancellation), estimación conjunta (joint
estimation), aprendizaje supervisado (supervised learning), aprendizaje no
supervisado (unsupervised learning), búsqueda de coincidencias (matching
pursuit), modelos bayesianos (Bayesian models), modelos espectrales
estadı́sticos (statistical spectral models), sistemas de pizarra (blackboard
systems), y métodos de coincidencia con bases de datos (database
matching). Se ha hecho un análisis de los puntos fuertes y débiles de
cada una de estas categorı́as.
187
A. RESUMEN
188
correspondientes alturas etiquetadas. Como se ha descrito en este trabajo, no
es una tarea sencilla alinear una base de datos de este tipo. Sin embargo, esta
es una linea de investigación que deberı́a ser explorada.
Probablemente, los sistemas de estimación de frecuencias fundamentales
que sólo consideran ventanas individuales no podrán ser capaces de mejorar
significativamente los resultados actuales. La cantidad de datos presente en el
periodo correspondiente a una ventana no es suficiente para detectar las alturas,
incluso para un músico experto. El contexto juega un papel muy importante
en la música. Por ejemplo, es muy complicado identificar las alturas cuando
escuchamos dos canciones que no están sincronizadas sonando simultáneamente,
incluso si no son muy complejas.
Por tanto, la tarea de estimación de las alturas deberı́a complementarse
de algún modo con información temporal. La coherencia de las detecciones
a lo largo del tiempo se ha considerado en uno de los métodos propuestos,
pero podrı́a extenderse usando un sistema fiable de seguimiento de frecuencias
fundamentales. Sin embargo, esta tarea es complicada desde un punto de vista
computacional.
La investigación futura de las arquitecturas alternativas propuestas para los
métodos de estimación conjunta también es una linea de trabajo prometedora.
La metodologı́a de este sistema permite, por ejemplo, analizar conjuntamente
combinaciones de aquellas ventanas que están entre dos onsets consecutivos,
obteniendo las alturas para este intervalo. Perceptualmente, los resultados
obtenidos con este esquema fueron mejores que analizando aisladamente con-
juntos adyacentes de ventanas. Sin embargo, la menor resolución temporal y los
errores en el método de detección de onsets, sumado al problema de detección de
los offsets en el intervalo entre dos onsets, condicionan el porcentaje de acierto
usando una métrica de evaluación clásica. Como trabajo futuro, está planeado
evaluar estas arquitecturas usando una métrica perceptual.
La información multimodal también es una linea de trabajo prometedora. La
inclusión de modelos musicales considerando la tonalidad, el tempo o la métrica
para inferir probabilidades de notas podrı́a complementar las estimaciones de
las alturas. Estas lineas de trabajo están planeadas dentro del proyecto DRIMS
en colaboración con el Music Technology Group de la Universitat Pompeu Fabra
de Barcelona.
La transcripción musical interactiva también está planeada como trabajo
futuro dentro del proyecto MIPRCV (Consolider Ingenio 2010). Se trata
de desarrollar un método asistido por ordenador para transcripción musical.
Usando un interfaz visual, los segmentos transcritos automáticamente pueden
ser aceptados o corregidos por un músico experto. Después, estos segmentos
validados pueden usarse como información para el método de estimación de
189
A. RESUMEN
190
Bibliography
Ahmed, N., Natarjan, T., and Rao, K. (1974). Discrete cosine transform. IEEE
Trans. on Computers, 23:90–93. (Cited on page 16).
Bello, J. P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., and Sandler,
M. B. (2005). A tutorial on onset detection in music signals. IEEE Trans. on
Speech and Audio Processing, 13(5):1035–1047. (Cited on pages 52, 53, 77,
and 78).
191
BIBLIOGRAPHY
Bello, J. P., Duxbury, C., Davies, M., and Sandler, M. (2004). On the use of
phase and energy for musical onset detection in the complex domain. IEEE
Signal Processing Letters, 11(6):553–556. (Cited on page 79).
Bertin, N., Badeau, R., and Richard, G. (2007). Blind signal decompositions
for automatic transcription of polyphonic music: NMF and K-SVD on the
benchmark. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), volume I, pages 65–68, Honolulu, HI. (Cited on page 70).
Brossier, P. (2005). Fast onset detection using aubio. In MIREX (2005), onset
detection contest. (Cited on page 92).
192
BIBLIOGRAPHY
Cao, C., Li, M., Liu, J., and Yan, Y. (2007). Multiple f0 estimation in polyphonic
music. In MIREX (2007), multiple f0 estimation and tracking contest. (Cited
on pages 65 and 160).
193
BIBLIOGRAPHY
Cemgil, A. T., Kappen, B., and Barber, D. (2003). Generative model based
polyphonic music transcription. In IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics, pages 181–184. (Cited on pages 1 and 72).
Cemgil, A. T., Kappen, H. J., and Barber, D. (2006). A generative model for
music transcription. IEEE Trans. on Audio, Speech and Language Processing,
14(2):679–694. (Cited on page 72).
Chang, W. C., Su, A. W. Y., Yeh, C., Roebel, A., and Rodet, X. (2008).
Multiple-F0 tracking based on a high-order HMM model. In Proc. of the 11th
Int. Conference on Digital Audio Effects (DAFx), Espoo, Finland. (Cited on
pages 66, 168, and 173).
Cont, A. (2008). Modeling Musical Anticipation: From the time of music to the
music of time. PhD thesis, University of Paris VI and University of California
in San Diego. (Cited on page 44).
194
BIBLIOGRAPHY
Davy, M., Godsill, S. J., and Idier, J. (2006). Bayesian analysis of polyphonic
western tonal music. Journal of the Acoustical Society of America, 119:2498–
2517. (Cited on page 72).
195
BIBLIOGRAPHY
Dixon, S. (2006). Onset detection revisited. In Proc. of the Int. Conf. on Digital
Audio Effects (DAFx), pages 133–137, Montreal, Canada. (Cited on pages
77, 78, 79, 91, 92, and 140).
Duda, R., Lyon, R., and Slaney, M. (1990). Correlograms and the separation
of sounds. In Proc. IEEE Asilomar Conference on Signals, Systems and
Computers. (Cited on page 60).
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. John
Wiley and Sons. (Cited on pages xi, 38, 39, 40, 41, and 104).
Durrieu, J. L., Richard, G., and David, B. (2008). Singer melody extraction
in polyphonic signals using source separation methods. In Proc of the
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 169–172, Las Vegas, NV. (Cited on page 163).
Duxbury, C., Sandler, M., and Davies, M. (2002). A hybrid approach to musical
note onset detection. In Proc. Digital Audio Effects Conference (DAFx), pages
33–38, Hamburg, Germany. (Cited on pages 78, 80, and 83).
196
BIBLIOGRAPHY
Emiya, V., Badeau, R., and David, B. (2007). Multipitch estimation and
tracking of inharmonic sounds in colored noise. In MIREX (2007), multiple
f0 estimation and tracking contest. (Cited on pages 159 and 160).
Emiya, V., Badeu, R., and David, B. (2008b). Automatic transcription of piano
music based on HMM tracking of jointly-estimated pitches. In Proc. European
Signal Processing Conference (EUSIPCO), Rhodes, Greece. (Cited on pages
66, 123, 124, 128, 129, 159, and 160).
197
BIBLIOGRAPHY
Goto, M. and Muraoka, Y. (1995). A real-time beat tracking system for audio
signals. In Proc. of International Computer Music Conference (ICMC), pages
171–174. (Cited on page 85).
198
BIBLIOGRAPHY
Harris, F. J. (1978). On the use of windows for harmonic analysis with the
discrete Fourier transform. Proceedings of the IEEE, 66(1):51–83. (Cited on
page 11).
Honing, H. (2001). From time to time: The representation of timing and tempo.
Computer Music Journal, 25(3):50–61. (Cited on page 31).
Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A
guide to theory, algorithm, and system development. Prentice Hall. (Cited
on pages 16 and 22).
199
BIBLIOGRAPHY
Juslin, P. N., Karlsson, J., Lindström, E., Friberg, A., and Schoonderwaldt,
E. (2006). Play it again with feeling: Computer feedback in musical
communication of emotions. Journal of Experimental Psychology: Applied,
12(2):79–95. (Cited on page 33).
200
BIBLIOGRAPHY
Klapuri, A., Eronen, A. J., and Astola, J. T. (2006). Analysis of the meter
of acoustic musical signals. IEEE Trans. on Audio, Speech and Language
Processing, 14(1):342–355. (Cited on page 32).
Klapuri, A., Virtanen, T., and Holm, J.-M. (2000). Robust multipitch
estimation for the analysis and manipulation of polyphonic musical signals. In
Proc. COST-G6 Conference on Digital Audio Effects (DAFx), pages 233–236.
(Cited on page 44).
201
BIBLIOGRAPHY
Kosuke, I., Ken’Ichi, M., and Tsutomu, N. (2003). Ear advantage and
consonance of dichotic pitch intervals in absolute-pitch possessors. Brain and
cognition, 53(3):464–471. (Cited on page 31).
Lacoste, A. and Eck, D. (2005). Onset detection with artificial neural networks.
In MIREX (2005), onset detection contest. (Cited on page 78).
Lee, W.-C. and Kuo, C.-C. J. (2006). Musical onset detection based on adaptive
linear prediction. IEEE International Conference on Multimedia and Expo,
0:957–960. (Cited on page 78).
Lee, W.-C., Shiu, Y., and Kuo, C.-C. J. (2007). Musical onset detection
with linear prediction and joint features. In MIREX (2007), onset detection
contest. (Cited on page 79).
Leveau, P., Vincent, E., Richard, G., and Daudet, L. (2008). Instrument-specific
harmonic atoms for mid-level music representation. IEEE Trans. on Audio,
Speech, and Language Processing, 16(1):116 – 128. (Cited on pages xi and 71).
202
BIBLIOGRAPHY
Lidy, T., Grecu, A., Rauber, A., Pertusa, A., Ponce de León, P. J., and Iñesta,
J. M. (2009). A multi-feature multi-classifier ensemble approach for audio
music classification. In MIREX (2009), audio genre classification contest,
Kobe, Japan. (Cited on page 4).
Lidy, T., Rauber, A., Pertusa, A., and Iñesta, J. M. (2007). Improving
genre classification by combination of audio and symbolic descriptors using a
transcription system. In Proc. of the 8th International Conference on Music
Information Retrieval (ISMIR), pages 61–66, Vienna, Austria. (Cited on
pages 4, 52, 119, 159, and 168).
Lidy, T., Rauber, A., Pertusa, A., Ponce de León, P. J., and Iñesta, J. M. (2008).
Audio Music Classification Using A Combination Of Spectral, Timbral,
Rhythmic, Temporal And Symbolic Features. In MIREX (2008), audio genre
classification contest, Philadelphia, PA. (Cited on page 4).
Lloyd, L. S. (1970). Music and Sound. Ayer Publishing. (Cited on page 48).
203
BIBLIOGRAPHY
Marolt, M., Kavcic, A., and Privosnik, M. (2002). Neural networks for note onset
detection in piano music. In Proc. International Computer Music Conference
(ICMC), Gothenburg, Sweden. (Cited on page 81).
204
BIBLIOGRAPHY
205
BIBLIOGRAPHY
Paiement, J. C., Grandvalet, Y., and Bengio, S. (2008). Predictive models for
music. Connection Science, 21:253–272. (Cited on page 44).
Patterson, R. D., Nimmo-Smith, I., Weber, D. L., and Milroy, R. (1982). The
deterioration of hearing with age: Frequency selectivity, the critical ratio,
the audiogram, and speech threshold. Journal of the Acoustical Society of
America, 72:1788–1803. (Cited on page 17).
Peeters, G. (2004). A large set of audio features for sound description (similarity
and classification) in the CUIDADO project. Technical report, IRCAM, Paris,
France. (Cited on pages 18, 19, and 20).
206
BIBLIOGRAPHY
Pertusa, A. and Iñesta, J. M. (2009). Note onset detection using one semitone
filter-bank for MIREX 2009. In MIREX (2009), onset detection contest.
(Cited on pages 84, 95, and 96).
Pertusa, A., Klapuri, A., and Iñesta, J. M. (2005). Recognition of note onsets
in digital music using semitone bands. Lecture Notes in Computer Science,
3773:869–879. (Cited on pages 84 and 92).
Plumbley, M. D., Abdallah, S., Bello, J. P., Daview, M., Monti, G., and Sandler,
M. (2002). Automatic music transcription and audio source separation.
Cybernetic and Systems, 33(6):603–627. (Cited on pages 69, 70, and 74).
207
BIBLIOGRAPHY
208
BIBLIOGRAPHY
Rodet, X., Escribe, J., and Durignon, S. (2004). Improving score to audio
alignment: Percussion alignment and precise onset estimation. In Proc. of
Int. Computer Music Conference (ICMC), pages 450–453. (Cited on page
53).
Sano, H. and Jenkins, B. K. (1989). A neural network model for pitch perception.
Computer Music Journal, 13(3):41–48. (Cited on page 102).
209
BIBLIOGRAPHY
Schwefel, H. P. (1995). Evolution and Optimum Seeking. Wiley & Sons, New
York. (Cited on page 66).
Serra, X. (1997). Musical sound modeling with sinusoids plus noise. In Roads,
C., Pope, S. T., Picialli, A., and De Poli, G., editors, Musical signal processing,
pages 91–122. Swets and Zeitlinger. (Cited on pages 23 and 59).
210
BIBLIOGRAPHY
Stevens, S., Volkman, J., and Newman, E. (1937). A scale for the measurement
of the psychological magnitude of pitch. Journal of the Acoustical Society of
America, 8(3):185–190. (Cited on page 16).
Tan, H. L., Zhu, Y., and Chaisorn, L. (2009). An energy-based and pitch-based
approach to audio onset detection. In MIREX (2009), onset detection contest.
(Cited on pages 52, 82, 94, 95, 96, 97, and 99).
Tzanetakis, G., Essl, G., and Cook, P. (2001). Audio analysis using the discrete
wavelet transform. In Proc. Conf. in Acoustics and Music Theory Applications
(WSES). (Cited on page 13).
Vidal, E., Casacuberta, F., Rodrı́guez, L., Civera, J., and Martı́nez, C. D.
(2006). Computer-assisted translation using speech recognition. IEEE Trans.
on Audio, Speech and Language Processing, 14(3):941–951. (Cited on page
173).
211
BIBLIOGRAPHY
Vincent, E., Bertin, N., and Badeau, R. (2007). Two nonnegative matrix
factorization methods for polyphonic pitch transcription. In MIREX (2007),
multiple f0 estimation and tracking contest. (Cited on pages 70, 159, 160,
163, and 164).
Vincent, E. and Rodet, X. (2004). Music transcription with ISA and HMM. In
Proc. 5th International Conference on Independent Component Analysis and
Blind Signal Separation, pages 1197–1204. (Cited on page 72).
Virtanen, T. (2000). Audio signal modeling with sinusoids plus noise. MSc
Thesis, Tampere University of Technology. (Cited on pages 121 and 122).
212
BIBLIOGRAPHY
Walmsley, P., Godsill, S., and Rayner, P. (1999). Polyphonic pitch tracking
using joint bayesian estimation of multiple frame parameters. In Proc.
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA), pages 119–122, New Paltz, NY. (Cited on page 72).
Wan, J., Wu, Y., and Dai, H. (2005). A harmonic enhancement based multipitch
estimation algorithm. In IEEE International Symposium on Communications
and Information Technology (ISCIT) 2005, volume 1, pages 772 – 776. (Cited
on page 65).
Wang, W., Luo, Y., Chambers, J. A., and Sanei, S. (2008). Note onset detection
via nonnegative factorization of magnitude spectrum. EURASIP Journal on
Advances in Signal Processing. (Cited on page 81).
Wood, A. (2008). The physics of music. Davies Press. (Cited on page 47).
Woodruff, J., Li, Y., and Wang, D. (2008). Resolving overlapping harmonics
fpr monoaural musical sound separation using pitch and common amplitude
modulation. In Proc. of the International Symposium on Music Information
Retrieval (ISMIR), pages 538–543, Philadelphia, PA. (Cited on page 130).
Yeh, C., Röbel, A., and Rodet, X. (2005). Multiple fundamental frequency
estimation of polyphonic music signals. In IEEE, Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), volume III, pages 225–228,
Philadelphia, PA. (Cited on pages 65, 126, 127, 129, and 130).
Yeh, C., Roebel, A., and Chang, W. C. (2008). Multiple F0 estimation for
MIREX 08. In MIREX (2008), multiple f0 estimation and tracking contest.
(Cited on pages 163, 164, and 167).
213
BIBLIOGRAPHY
Yeh, C., Roebel, A., and Rodet, X. (2006). Multiple f0 tracking in solo
recordings of monodic instruments. In Proc. of the 120th AES Convention,
Paris, France. (Cited on pages 48 and 66).
Yin, J., Sim, T., Wang, Y., and Shenoy, A. (2005). Music transcription using
an instrument model. In Proc. of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), volume III, pages 217–
220. (Cited on page 65).
Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., and Woodland, P.
(2000). The HTK book (for HTK version 3.1). Cambridge University. (Cited
on page 87).
Zhou, R., Mattavelli, M., and Zoia, G. (2008). Music Onset Detection Based
on Resonator Time Frequency Image. IEEE Transactions On Audio, Speech
And Language Processing, 16(8):1685–1695. (Cited on page 80).
Zhou, R., Reiss, J. D., Mattavelli, M., and Zoia, G. (2009). A computationally
efficient method for polyphonic pitch estimation. EURASIP Journal on
Advances in Signal Processing, (28). (Cited on pages 63, 127, and 160).
Zhu, Y. and Kankanhalli, M. (2006). Precise pitch profile feature extraction from
musical audio for key detection. IEEE Trans. on Multimedia, 8(3):575–584.
(Cited on pages 94 and 99).
Zwicker, E. (1961). Subdivision of the audible frequency range into critical bands
(Frequenzgruppen). Journal of the Acoustical Society of America, 33(2):248.
(Cited on page 16).
214
BIBLIOGRAPHY
215