Thesis Fitz
Thesis Fitz
Derry FitzGerald,
Conservatory of Music and Drama,
Dublin Institute of Technology.
2004
Abstract
While research has been carried out on automated polyphonic music transcription, to-date
the problem of automated polyphonic percussion transcription has not received the same
degree of attention. A related problem is that of sound source separation, which attempts
to separate a mixture signal into its constituent sources. This thesis focuses on the task of
polyphonic percussion transcription and sound source separation of a limited set of drum
instruments, namely the drums found in the standard rock/pop drum kit.
As there was little previous research on polyphonic percussion transcription a
broad review of music information retrieval methods, including previous polyphonic
percussion systems, was also carried out to determine if there were any methods which
were of potential use in the area of polyphonic drum transcription. Following on from
this a review was conducted of general source separation and redundancy reduction
techniques, such as Independent Component Analysis and Independent Subspace
Analysis, as these techniques have shown potential in separating mixtures of sources.
Upon completion of the review it was decided that a combination of the blind
separation approach, Independent Subspace Analysis (ISA), with the use of prior
knowledge as used in music information retrieval methods, was the best approach to
tackling the problem of polyphonic percussion transcription as well as that of sound
source separation.
A number of new algorithms which combine the use of prior knowledge with the
source separation abilities of techniques such as ISA are presented. These include sub-
band ISA, Prior Subspace Analysis (PSA), and an automatic modelling and grouping
technique which is used in conjunction with PSA to perform polyphonic percussion
transcription. These approaches are demonstrated to be effective in the task of polyphonic
percussion transcription, and PSA is also demonstrated to be capable of transcribing
drums in the presence of pitched instruments.
A sound source separation scheme is presented, which combines two previous
separation methods, ISA and the DUET algorithm with the use of prior knowledge
obtained from the transcription algorithms to allow percussion instrument source
separation.
i
Acknowledgements
Acknowledgements
This thesis would not be what it is without the help, encouragement and friendship of
many people, so now is the time to give credit where credit is due.
I would like to thank Paul McGettrick for starting things off and for giving me the
freedom to choose the topic which became the focus of my research these past few years.
I would also like to thank him for all his encouragement and support.
I would like to thank my principal supervisor, Dr. Bob Lawlor, for all his
patience, guidance, support and encouragement, especially for all the times when I’m
sure it seemed like this research was going nowhere. I would like to thank Dr. Eugene
Coyle for his enthusiasm and support and for encouraging the final push to completion of
this thesis. I would also like to thank Dr. Dermot Furlong for all his advice and ability to
see ways out of some of the corners I painted myself into during the course of this
research.
Thanks also to Ben Rawlins and Charlie Cullen for the all the sanity preserving
long lunches and daft discussions on popular culture, as well as putting up with all my
bad jokes.
Thanks to Dan Barry for proof-reading this thesis and for discussions on source
separation in general.
I would like to thank my parents and brothers and sister for all their
encouragement throughout the years.
Finally, I would like to thank my wife Mayte for putting up with all my ups and
downs and mood swings over the past few years. Your love and support continues to
amaze me. This thesis is dedicated to you and our baby son Kevin.
ii
Contents
CONTENTS
ABSTRACT I
ACKNOWLEDGEMENTS II
CONTENTS III
TABLE OF FIGURES VI
1. INTRODUCTION 1
1.4 Conclusions 10
iii
Contents
2.7 Conclusions 61
iv
Contents
4.5 Drum Transcription in the presence of pitched instruments using PSA 145
4.5.1 Interference due to pitched instruments 146
4.5.2 ICA and noisy signals 152
4.5.3 Test Results 153
BIBLIOGRAPHY 185
v
Table of Figures
Table of Figures
vi
Table of Figures
vii
List of Tables
List of Tables
viii
Introduction
1. Introduction
The human auditory system is a remarkable information processing system. From just
two input channels we are able to identify and extract information on a large number of
sources. We are all familiar with our ability to pick out what someone is saying during a
conversation that takes place in a noisy environment such as a crowded bar or a rock
concert, as well as our ability to recognise several different sounds occurring at once.
These are everyday tasks that people perform without paying any attention to, taking
these abilities for granted, without ever realising just how efficient the human auditory
system is at performing this difficult task. This ability in humans has been studied as part
of psychoacoustics under the title of auditory scene analysis [Bregman 90]. Attempts to
replicate this ability using computers have been studied under the term Computational
Auditory Scene Analysis, such as the work carried out in [Ellis 96].
An interesting subset of the more general field of auditory scene analysis is that of
music transcription. Some form of music exists in all cultures throughout the world and
we are exposed to it constantly in our daily lives, on television and radio, in shops and
bars. We are all to a greater or lesser extent able to tap along to the rhythm of a piece of
music and can easily identify the melody in a song, but the task of identifying the
underlying harmonic structure requires specialised training and practice. In a manner
analogous to the way an experienced car mechanic can identify certain faults by listening
to the car engine running, the skilled musician is able to transcribe music just by listening
to it, identifying a series of notes, their respective pitches if pitched instruments are used,
and the associated instruments. Transcription of music can be defined as listening to a
piece of music and writing down musical notation that corresponds to the events or notes
that make up the piece of music in question. In effect an acoustic signal is analysed and
then represented with some form of symbolic notation.
However, getting a computer to mimic the abilities of a human listener is no
trivial task. Even the seemingly simple task for humans of tapping along to a piece of
music represents a difficult task for computers, and indeed much research has gone into
creating systems that are capable of tapping along to a given piece of music [Scheirer 98],
[Smith 99]. The more difficult task of automated music transcription has also received
1
Introduction
2
Introduction
obviously make the task of automatic transcription considerably easier by allowing the
transcription algorithm to focus on a single source at a time without interference from the
other sources. Conversely, in some cases, having a transcription of the contents of the
signal can be of use in sound source separation, by providing information to the
separation algorithm which can be used to guide and aid the separation. The past few
years have seen a growth in interest in the problem of sound source separation, with the
development of techniques such as Independent Component Analysis (ICA) [Comon 94],
Independent Subspace Analysis (ISA) [Casey 00] and the Degenerate Unmixing
Estimation Technique (DUET) algorithm [Yilmaz 02].
The potential uses of such sound source separation systems are numerous,
including their use in hearing aids, as an aid for a student studying a performance by a
given performer which occurs as part of an ensemble piece, and if the audio quality is
sufficiently high, it could potentially be sampled for use in another piece. This is a
practice that has become widespread in popular music today, whereby a new song or
piece is built upon a section taken from another recording. The ability to separate sources
would allow increased flexibility and greater choice in the materials that could be chosen
as the basis for a new piece. Another potential application is the automatic conversion of
stereo recordings to 5.1 surround sound.
It should be noted that, for monophonic unpitched instruments such as drums,
obtaining a set of separated signals considerably simplifies the transcription problem,
reducing it to that of identifying each of the sources and detecting the onset time of each
event in the separated signal. On the other hand, carrying out sound source separation on
a mixture of piano and guitar where both instruments are playing chords still leaves the
notes played on each source to be transcribed. Therefore, the problem of transcribing
polyphonic percussive music can be seen to be more closely related to the problem of
sound source separation than that of transcribing polyphonic pitched music. As it was felt
that any scheme for the transcription of polyphonic percussive music would involve some
degree of sound source separation, it was decided to attempt sound source separation of
percussive instruments at the same time as attempting the transcription of these
instruments.
3
Introduction
This thesis deals with the creation of systems for the transcription and sound source
separation of percussive pitched instruments. It was decided to limit the set of percussive
instruments to be transcribed to those drums found in the typical “standard” rock/pop
drum kit. The drums in question are the snare drum, the bass drum (also known as the
kick drum – these two names are used synonymously throughout this thesis), the tom-
toms, and hi-hats and cymbals. These drums were chosen as they represent the most
commonly occurring percussive instruments in popular music, and it was felt that a
system that could transcribe these drums would be a system that would work in a large
number of cases, as well as providing a good starting point for more general systems in
the future. It was also felt that to be able to transcribe a small set of drums robustly would
be better than to have a system that attempted to transcribe a larger number of percussive
instruments less accurately. Previous attempts at polyphonic percussion transcription
focused on these drums for much the same reasons but were not particularly successful in
transcribing these drums robustly [Goto 94], [Sillanpää 00].
Transcription, in the context of this thesis, is taken to be simply a list of sound
sources and the time at which each occurrence of the sound sources in question occur. It
was decided not to pursue fitting the transcription results to a metric grid, as establishing
such a grid for an audio signal is not yet a solved problem, and also it was felt that such
an attempt would distract from the true focus of the work, namely to transcribe
recordings of polyphonic percussion. When used in this thesis in the context of
percussion transcription, the word “polyphonic” is taken to mean the occurrence of two
or more sound sources simultaneously. In similar contexts “monophonic” can be taken to
mean one sound source occurring at a time.
It was also decided to focus on the case of percussion instrument transcription and
separation in the context of single channel mixtures. The reason for this was that it was
felt that a system that could work on single channel audio would be more readily
extended to stereo or multi-channel situations, rather than vice-versa. Also, the single
channel case is in some respects more difficult than cases where two or more channels
are available, where the use of spatial cues such as pan, could be leveraged to obtain
information for transcription. However, even in multi-channel recordings these cues may
4
Introduction
not always be available, and so a system that does not need to use such cues will be
inherently more generally applicable. Systems designed for single channel audio reflect
such a situation.
As noted above, there has been a lack of research in the area of polyphonic
percussive music transcription. Even in cases where such attempts have been made, there
has been a lack of evaluation of the performance of many of the systems, making it
difficult to determine their effectiveness or otherwise. As a result of this lack of research,
it was decided to carry out a literature review which covered not just percussive music
transcription but any areas which it was felt may be of use in tackling the problem of
polyphonic percussion transcription. This review ranged over areas such as pitched
instrument transcription and rhythmic analysis to sound source separation algorithms
such as ICA and the DUET algorithm.
Having carried out such a review, it was decided that a system that combined the
separation abilities of algorithms such as Independent Subspace Analysis with the use of
prior knowledge, such as simple models of percussive instruments, represented the best
route for tackling the problem of polyphonic percussive music transcription.
The main contribution of this work is the development of a number of algorithms
that are capable of transcribing robustly the drums found in a standard rock/pop drum kit
through the use of a technique which we call ‘Prior Subspace Analysis’ (PSA) in
conjunction with an automatic modelling and grouping technique for these drum sounds.
The transcription results obtained using these techniques are then used to guide a sound
source separation system for these drums. The sound source separation scheme proposed
is a novel combination of two previously existing source separation methods,
Independent Subspace Analysis, and the binary time-frequency masking used for sound
source separation in the DUET algorithm. An extension of PSA is also demonstrated to
be effective in transcribing snare, kick drum and hi-hats or cymbals in the presence of
pitched instruments.
A secondary contribution of this work is the reformulation of ISA to incorporate a
new dimensional reduction method, called ‘Locally Linear Embedding’ [Saul 03]. This is
demonstrated to be better at recovering low amplitude sources than the original
formulation of ISA. A reformulation of ISA to achieve independence in both time and
5
Introduction
As this thesis deals with the drum sounds found in a “standard” rock/pop kit, it was felt it
would be appropriate to describe briefly the properties of these drums. These properties
are summarised in section 1.3.
The literature review has been divided into two sections. The first section details
systems that were designed explicitly for use with musical signals or with speech signals
and is contained in Chapter Two of this thesis. The systems described often make use of
knowledge obtained from studies of the human auditory system, such as the studies
carried out by Bregman [Bregman 90]. Topics covered include drum transcription, beat
tracking and rhythm recognition, sinusoidal modelling, musical instrument identification,
polyphonic music transcription and sound source separation systems. These topics were
analysed with a view to determining any techniques or methodologies which could be of
use in the problem of polyphonic percussion transcription and sound source separation.
The second section of the literature review, contained in Chapter Three, deals
with information theoretic or redundancy reduction based approaches to extracting
information from signals or data sets. With one exception, namely ISA, the systems
described were not explicitly designed for use with audio, though even ISA had similar
precedents in the field of image analysis. These systems represent general approaches to
extracting information from a data set and were felt to have potential use in extracting
information from audio signals. These techniques included Principal Component
Analysis (PCA), ICA, ISA and Sparse Coding. The advantages and disadvantages of
these various techniques are also discussed in detail. This chapter also contains two novel
contributions, firstly the reformulation of ISA to achieve independence in both time and
frequency simultaneously and secondly another reformulation of ISA to incorporate a
new dimensional reduction technique called “Locally Linear Embedding” which offers
some advantages over the use of PCA as a technique for dimensional reduction when
used as part of ISA.
6
Introduction
Chapter Four contains the bulk of the original contributions in this thesis. It
details why the approach to the problem of polyphonic percussion transcription taken was
chosen. This approach was the use of ISA-type methodologies described in Chapter 3 in
conjunction with the incorporation of prior knowledge such as that used in many of the
systems described in Chapter Two. Firstly a simple sub-band version of ISA is
implemented which takes advantage of the fact that different types of drums have their
energies concentrated in different regions of the frequency spectrum. This system is
shown to be capable of transcribing drum loops containing snares, kick drums and hi-
hats.
A new technique for transcribing drums called Prior Subspace Analysis (PSA) is
then derived, which makes use of prior models of the spectra of drum sounds to eliminate
the need for the dimensional reduction step involved in ISA. This has the advantage of
making PSA much faster than ISA or sub-band ISA, but more importantly it is more
robust in determining drums which are typically at lower amplitudes such as hi-hats. PSA
is demonstrated to be robust under a wide range of conditions provided that the main
energy regions of the spectra of the sources to be transcribed do not coincide. PSA is also
shown to be capable of transcribing snare, kick drums and hi-hats or ride cymbals in the
presence of pitched instruments. To overcome the problem of drums which have
coinciding main energy regions an extended version of PSA which incorporates an
automatic modelling and grouping stage is implemented and is shown to be capable of
transcribing mixtures of snare, kick, tom-toms, hi-hats and cymbals in a wide range of
conditions.
Chapter Five then contains original contributions related to the sound source
separation of percussive instruments. In particular, it describes new techniques
specifically tailored to the sound source separation of the drum sounds. These include the
use of ISA with a new clustering algorithm which takes advantage of the transcription
results to obtain re-synthesis of the drum sources of high amplitude, and then the use of
binary time-frequency masking such as used in the DUET algorithm to recover the drum
sources of low amplitude such as the hi-hats. This results in a hybrid sound source
separation system which takes advantage of the best properties of two previously existing
sound source separation methods.
7
Introduction
Chapter Six then contains conclusions on the work done and also highlights areas
for future research in the area of polyphonic percussion transcription.
Appendix 1 provides a detailed breakdown of the drum transcription results
obtained when testing drum transcription on excerpts from pop and rock songs, while
Appendix 2, which is to be found on the accompanying CD, contains audio examples
related to selected figures throughout this thesis.
The drums of interest in this thesis can be divided into two categories, membranes and
plates. Membranes include snare, bass drum and tom-toms, while plates include hi-hats
and cymbals. Sound is produced by striking the membrane or plate, typically this is done
using a drum stick, though sometimes brushes can be used. The exception to this is the
kick drum which is typically struck using a beater made of epoxy or rubber which is
mounted on a foot pedal. The striking of a given drum with a stick or beater can be
modeled as an impulse function, and so a broad range of frequencies will be present in
the impact. Therefore, all possible modes of vibration of the plate or membrane will be
excited simultaneously. The narrower the frequency band associated with a given mode
the longer the mode will sound for. The interested reader is referred to Fletcher and
Rossing [Fletcher 98] for a detailed mathematical account of the properties of ideal
membranes and plates.
The bass drum is typically of diameter 50-75 cm, and has two membranes, one on
each side of the drum. The membrane that is struck is termed the beating head, and the
other membrane is termed the resonating head. In a standard rock/pop drum kit a hole is
often cut in the resonating head. Typically the beating head will be tuned to a greater
tension than the resonating head.
The snare drum is a two-headed membrane drum. Typically it is in the region of
35 cm in diameter and 13-20 cm deep. Strands of wire or gut known as the snares are
stretched across the lower head. When the upper head is struck the lower head will
vibrate against the snares. At a large enough amplitude of the lower head the snares will
leave the lower head at some stage in the vibration cycle. The snares will then
8
Introduction
subsequently return to strike the lower head, resulting in the characteristic sound of the
snare drum.
Tom-toms are membrane drums which range in size from 20-45 cm in diameter
and depths of 20 to 50 cm in depth, and can have either one or two heads. More so than
the other membrane drums, tom-toms tend to have an identifiable pitch, particularly tom-
toms with single heads. If a tom-tom is struck sufficiently hard enough, the deflection of
the head may be large enough to result in a significant change in the tension of the head.
This change in tension momentarily raises the frequencies of all the modes of vibration
and so the apparent pitch is higher than it would otherwise be. As the vibrations of the
head die away the tension gradually returns to its original value, resulting in a perceived
pitch glide which is characteristic of tom-toms.
All the membrane drums discussed here are capable of being tuned by adjusting
the tension of the heads. This, in conjunction with the different sizes available for each
drum, means that there can be considerable variation in the timbre obtained within each
drum type. However, it can be noted that membrane drums have most of their spectral
energy contained in the lower regions of the frequency spectrum, typically below 500 Hz,
with the snare usually containing more high frequency energy that the other membrane
drums. Also, within the context of a given drum kit, the kick drum will have a lower
spectral centroid than that of the snare drum.
The remaining two drum types being dealt with in this thesis, cymbals and hi-
hats, are metallic plate drums. Cymbals are typically made of bronze and range from 20
cm to 74 cm in diameter. They are saucer shaped with a small spherical dome in the
center. Hi-hats are of similar shape, but their range in size is smaller. Hi-hats consist of
two such plates mounted on a stand attached to a pedal, the position of which determines
whether the two plates are pressed together in what is termed “closed”, or are free to
vibrate at a distance from each other in what is termed “open”. In the closed position the
plates are restricted in their vibration by contact with each other, resulting in a sound
which is typically shorter in duration and less energetic than that obtained in the open
position which has a sound which is closer to that of a cymbal.
In general, the plate drums tend to have their spectral energy spread out more
evenly across the frequency spectrum than the membrane drums, and so contain
9
Introduction
significantly more high frequency content. It can also be observed that in most recordings
the hi-hats and cymbals are of lower amplitude than the membrane drums.
1.4 Conclusions
This chapter has outlined the background to the problem of automatic polyphonic drum
transcription and has highlighted the lack of research in this area in comparison to that of
polyphonic music transcription. The scope of the thesis was then set out. The drums of
interest were limited to those found in a “standard” rock/pop drum kit, namely snare,
kick, tom-toms, hi-hats and cymbals. Transcription in the context of this thesis was
defined as a list of sound sources, and the time at which each occurrence of each sound
source occurs. Further, the transcription and source separation algorithms were to deal
with the most difficult case, namely single channel mixtures.
The remainder of the thesis was then outlined, with Chapter Two detailing a
review of music information retrieval techniques and Chapter Three dealing with
information theoretic approaches. Chapter Four deals with Drum Transcription
Algorithms and contains the bulk of the novel contributions in this thesis. Following on
from this, Chapter Five describes novel source separation algorithms for drum sounds,
and Chapter Six contains conclusions on the contributions made and areas for future
work.
Finally, a brief overview of the properties of the drums of interest was included,
and showed that these drums could be divided into two categories, membrane drums, and
metal plate drums. Having outlined the background and scope of the research, as well as
the properties of the drums of interest, the main work in this thesis follows in the
succeeding chapters.
10
Music Information Retrieval Methods
This chapter deals with various attempts and methodologies for extracting information
from musical signals. This is still a developing field with many problems still open for
further exploration. It encompasses many areas such as transcription (both of pitched
instruments and percussion instruments), sound source separation, instrument
identification, note onset detection, beat tracking and rhythmic analysis, as well as other
tasks such as “query by humming”, where a song is identified from a user humming a
melody. Due to the lack of work focusing solely on the transcription of percussion
instruments it was decided to extend the literature review to cover the methodologies
used in other areas of music information retrieval. This was done to see if anything could
be garnered from these methodologies and approaches that could be applied to the
problem of drum transcription.
Many approaches that are mentioned in this chapter could be loosely termed
psychoacoustic approaches to the problem of music information retrieval. The field of
psychoacoustics attempts to explain how our hearing system functions, including
auditory scene analysis capabilities such as how our ears group harmonics generated by a
given instrument to create the perception of a single instrument playing a given note. The
seminal work in the area of auditory scene analysis is that of Bregman [Bregman 90],
where he outlines many of the grouping rules that our hearing system uses. For example,
our auditory system tends to group together simultaneously occurring components that
are harmonically related, components that have common modulations in frequency and/or
amplitude, components that have common onsets and offsets, or components that come
from the same direction. Simultaneously occurring components that have these
characteristics will generally be perceived as coming from the same source. Further
grouping rules, such as for sequential events can be found in [Bregman 90]. Many of the
systems described in this chapter make use of psychoacoustic knowledge to enhance
signal processing techniques to extract information from audio signals. This contrasts
with the techniques used in Chapter 3 which make use of information theoretic and
redundancy reduction principles to extract information from signals in general and which
are not specifically focused on extracting musical information.
11
Music Information Retrieval Methods
The first section of this chapter deals with previous attempts at drum transcription
systems, with greater emphasis given to systems which attempt polyphonic drum
transcription. The second section outlines work on beat tracking and rhythm analysis
systems. Both of these could be used to generate predictions of when a given drum is
likely to occur, which would be useful in automatic drum transcription as a means of
resolving ambiguities and uncertainties. The third section deals with sinusoidal modelling
and extensions to sinusoidal modelling. Sinusoidal modelling models signals as a sum of
sinusoids plus a noise residual, and so is of potential use in the removal of the effects of
pitched instruments in drum transcription systems where the drums occur in the presence
of other instruments. Section four deals with musical instrument identification, both in
the general case and in identification of percussion instruments only. Section five looks
briefly at the methods employed in polyphonic music transcription with a view to seeing
if any of the approaches could be adapted to drum transcription, while section six looks at
sound source separation algorithms such as Computational Auditory Scene Analysis.
These methods offer the potential means of dealing with mixtures of drum sounds, a
problem which has plagued previous attempts at drum transcription, as will be shown in
section one. Further sound source separation schemes based on information theoretic
principles will be dealt with in Chapter 3. Finally the relative merits and uses of these
Music Information Retrieval techniques will be summed up in the conclusion.
As noted in chapter 1, there have been very few previous attempts at polyphonic drum
transcription systems. Of those systems which have been developed there has been, in
general, a lack of systematic evaluation of the results obtained, making it difficult to
evaluate the effectiveness or otherwise of these systems.
Early work on the automatic transcription of percussive music was carried out by
Schloss and Blimes [Schloss 85], [Blimes 93]. These systems were designed to deal with
monophonic inputs, i.e. where only one note occurs at any given moment in time.
Schloss’ system was able to differentiate between several different types of conga stroke.
12
Music Information Retrieval Methods
This was done by using the relative energy of selected portions of the spectrum. Blimes’
system used a k-Nearest Neighbour classifier to distinguish between different types of
stroke of the same drum. These early attempts at percussion transcription have since
been superseded by the systems described below, which attempt to deal with the
transcription of polyphonic percussive music. In particular, the approaches used for onset
detection of events used in these studies have been displaced by multi-band onset
detection methods. Methods to automatically distinguish different musical instruments
have also improved considerably since these works.
The first attempt at a polyphonic drum transcription system was by Goto and
Muraoka in 1994 [Goto 94]. The paper (in Japanese) describes a polyphonic drum
transcription system based on a template matching system. The system proposed attempts
to transcribe snare, kick, toms, hi-hats and cymbals from drum loops. The templates are
obtained from examples of each drum type. Characteristic frequency points are identified
for each drum and these frequency points are then used to scale the template to match the
amplitude of the actual signal at these points. To overcome interference between some
drum types, in particular between metallic drums and skinned drums, the signal is filtered
into two bands, the low-pass band having a cutoff frequency of 1 kHz, and the high-pass
band having a cut-off of 5 kHz. Two examples of transcription are presented, but there
appears to be no test results based on a larger database of drum loops.
A system for transcription and isolation of drum sounds for audio mixtures was
implemented by Sillanpää et al [Sillanpää 00]. Again, the system presented makes use of
template matching for the identification of drum type, but, as with the Goto system, a few
examples were presented but no evaluation of performance on a larger database of
examples was carried out. The drum sounds which the system was designed for were
snare, kick, toms, hi-hats, and cymbals.
The onset times of events in the signal were calculated using the algorithm
described in [Klapuri 99] and this information was used in the calculation of a metrical
grid for the signal. The grid was calculated using inter-onset intervals and greatest
common divisors. Drum detection and recognition was then carried out at every onset
point in the metrical gird.
13
Music Information Retrieval Methods
The onset detection algorithm used was based upon the multi-band tempo
tracking system described by Scheirer [Scheirer 98] which is described in more detail in
section 2.2, but with a number of significant changes. An overview of the onset detection
system is shown in Figure 2.1. As shown below, the signal is first passed through a
filterbank. The number of bands used in the filterbank was increased to 21 filters, as
opposed to the 6 used by Scheirer. The output of each filter was then full wave rectified
and decimated to ease computations. Amplitude envelopes were then calculated by
convolving the outputs from each of the filters with a 100ms half-Hanning window,
which preserved sudden changes but masked rapid modulations. The filtering,
rectification and half-Hanning window convolution model, in a basic manner, how the
ear and basilar membrane process incoming sound waves. Once the amplitude envelopes
have been obtained, the envelopes are then passed to an onset component detection
module, as shown in Figure 2.1 Onset component detection was carried out by
calculating a first-order difference function, which is essentially the amount of change in
amplitude in relation to the signals level. This is also equivalent to differentiating the log
of the amplitude. This function is given by:
d
( A(t )) d
W (t ) = dt = (log( A(t ))) (2.1)
A(t ) dt
where W(t) is the first-order difference function, and A(t) is the amplitude of the signal at
time t. This had two advantages; it gave more accurate determination of the onset time
than the differential, and eliminated spurious onsets due to sounds that did not
monotonically increase. Onset components were then chosen to be those onsets that were
above a set threshold in the relative difference function.
14
Music Information Retrieval Methods
The intensity of the onset components was then estimated by multiplying the
maximum of the first-order difference function (as opposed to the first-order relative
difference function) by the filter-bands’ center frequency. Components that were closer
than 50ms to a more intense component were then dropped out.
Following on from this, the onset components from the separate bands were then
combined to give the onsets of the overall signal. This was carried out using a simplified
version of Moore’s model [Moore 97] for loudness to obtain loudness estimates for each
band. Onsets within 50ms of each other were summed to give estimates of overall onset
loudness. Overall onsets that were below a global threshold were eliminated, as were
onsets within 50ms of a louder onset candidate.
The onset detection algorithm was found to be a robust onset detection system for
most kinds of music. The exceptions were any kind of symphony orchestra performance.
This was thought to be due to the inability to follow individual instruments, and the
inability of the system to cope with strong amplitude modulations as found in symphonic
music.
Drum recognition was carried out using pattern matching using several models to
represent sub-classes within a given type of drum, for example snare drum or bass drum.
The drums were modeled by calculating the short time energy in each Bark scale critical
band. The Bark band z corresponding to the frequency f in kHz is estimated from:
2
f
z ( f ) = 13 arctan(0.76 f ) + 3.5 arctan (2.2)
7.5
The use of Bark critical bands was motivated by the fact that for stationary noise-like
signals the ear is not sensitive to variations of energy within each band. It was therefore
assumed that knowing the short-time energy at each Bark band was sufficient to model
the drum at a given time instant. The short time energy in each band was calculated on a
frame by frame basis, where the time resolution was on a logarithmic scale. The time
instants (in ms) for sampling on a logarithmic scale were calculated from:
t k = 10 exp(0.36k ) − 10 (2.3)
The reference point for calculation of these times is the onset time of the event of
interest. This was done to give greater emphasis to the start of the sounds, and so provide
a degree of robustness against varying duration of sounds.
15
Music Information Retrieval Methods
The models were created by obtaining the Bark frequency energies in decibels at
the time frames as described above for each training sample. These were combined with
the overall energy at each time frame to create a feature vector. Fuzzy k-means clustering
was then used to obtain four cluster centers per drum type. Four centers were used in an
attempt to overcome the variations in timbre found within each drum type.
The resulting models were matched to the features of the input signal. Weighted
least squares error fitting was carried out for each drum type with emphasis being placed
on bands where the model is a strong match. This was to aid robustness where
overlapping of sounds occurred. The fitting used was:
{
Ek = ∑ M k (i )[Y (i ) − M k (i )]
2
} (2.4)
i
where Y(i) is the feature vector of the mixture signal, Mk(i) is the feature vector of drum
type k and i runs through the values of the feature vectors. The overall energy of each
matched and scaled model is given by Wk. The goodness of fit for each drum type was
then given by:
Gk = Wk − 0.5 Ek (2.5)
The goodness of fit measures are then normalised and scaled to yield probabilities for
each drum type.
Temporal prediction was used as an aid to resolve potentially ambiguous
situations. It predicted the likelihood of a drum appearing in a given frame given its
appearances in earlier frames. The temporally predicted probability of a drum sound to
appear at frame k was calculated as follows. First the most prominent local periodicity,
and its probability of occurrence, for a sound s is found by calculating the product of
probabilities (obtained from the goodness of fit measures) of the sound s in every hth
surrounding frames:
K
[ ]
Ppred (s, k0 ) = max h ∏ {1 − w(k ) 1 − Pgf (s, k0 + hk ) } (2.6)
k = − K
where Ppred(s,k0) is the probability of sound s occurring every h frames, Pgf(s,k0) is the
probability obtained from the goodness of fit measure, K = 3, and w(k) is a windowing
function with w(k) = [3,7,10,0,10,7,3]/10. The overall effective probability Peff(s,k0) is
then given by:
16
Music Information Retrieval Methods
where
U (s 0 , k ) = ∑ [Pgf (s, k )S s,s ]0
(2.8)
s , s ≠ s0
where Ss,s0 is a value representing the similarity of sounds s and s0, and so the ability of
sound s to mask sound s0. This means that the prediction probability of s0 is dependent on
the likelihood of the sound being masked.
A priori probabilities were also used to help resolve ambiguity when the identity
of a drum was in doubt. This made use of the fact that certain drum sounds are more
likely than others, for example snare drums occur with greater frequency than tom-toms
in popular music, and that tom-toms often occur in a sequence of a few sounds, such as
during a drum fill. Context was also taken into account when using a-priori knowledge.
This was done by giving higher probabilistic priority to sounds that had already been
detected in the signal over sources that had not yet been detected.
To decide the number of simultaneous sources at any given moment the matched
and scaled sources are then subtracted from the mixture spectrum in order of descending
probability one by one in an iterative manner. The criteria used to stop the iteration and
keep the N selected models is given by:
∑ M N (i )Y (i )
L(N ) = − log i + αN (2.9)
∑ Y (i )Y (i )
i
where MN(i) is the sum of the models of the selected and scaled source models M1,…,MN,
Y(i) is the feature vector of the mixture signal and the value of α is obtained from training
on examples of mixtures of drum sounds.
Three types of test were carried out on the system. The first was straight
recognition of random drum mixtures. Here it was reported that recognition of single
sounds worked well with few errors, but that correct recognition of increasing numbers of
drums proved difficult, with confusion becoming commonplace for mixtures of three or
more drums. No percentage evaluation of performance was presented in [Sillanpää 00]
though a technical report on an earlier version of the system [Sillanpää 00a] did give
results for this type of test which give some indication of the performance of the system.
17
Music Information Retrieval Methods
For a single drum in isolation correct detection was achieved 87% of the time. For
mixtures of two drums both drums were correctly identified 48% of the time, with at least
one of the drums always being correctly identified. Finally for mixtures of three drums,
all three were correctly identified only 8% of the time, with two of the drums detected
correctly 60% of the time. The system always managed to identify at least one of the
drums correctly. The system in [Sillanpää 00a] also had 32% detection of an extra drum
when only a single drum was present, though this problem appears to have been solved in
[Sillanpää 00].
The second test involved the transcription of drum loops. It was observed that the
use of temporal prediction was able to restore a large number of masked sounds that had
been undetected or misclassified in the initial analysis of the drum mixtures. One
example was shown. However, no overall performance evaluation of the testing database
was presented.
The third type of test performed was the transcription of drums from excerpts
from popular music. In this case the excerpts were preprocessed by analysing each
excerpt using a sinusoids plus noise spectral model. This model is described in detail in
section 2.3. The sinusoids were assumed to contain the harmonic elements in the signal
and were subtracted from the spectrum to leave the noise residual. This was then assumed
to contain the drum sounds in the signal. Although not strictly true this approximation
was found to remove enough of the influence of pitched instruments to allow detection of
the drum sounds to be attempted. Some energy from the drum sounds was also removed,
particularly from the toms, where approximately half of the energy can be considered
periodic. However, enough energy was retained to allow attempting detection to proceed.
Detection errors were found to be greater with real musical signals than the rhythmic
patterns, and again no systematic evaluation of results was presented.
The results discussed indicated the benefit of top down processing in rhythm and
drum sound identification, but the system still had problems identifying mixtures of drum
sounds. Interference between drum sounds caused identification problems, and the
difference in levels between different drum sounds was not taken into account, for
example, the fact that the snare drum is generally much louder than the hi-hats. The lack
of systematic evaluation makes it difficult to determine how effective in practice this
18
Music Information Retrieval Methods
system was, and Sillanpää et al concluded that spectral pattern recognition on its own was
not sufficient for robust recognition of sound mixtures.
Attempts were also made at drum transcription using a cross-correlation approach
[Jørgensen 01]. Samples of snare, kick drum, tom-toms, hi-hats and cymbals were cross-
correlated with recordings of drum loops in an attempt to identify the drums present. The
system appeared to work reasonably well on snare and bass drum but did not work well
on other types of drum. A large number of false positives were reported with the system
and again no systematic evaluation was made of the overall performance.
A recent attempt at transcribing polyphonic drum signals made use of acoustic
models and N-grams [Paulus 03]. The acoustic models were used to model low-level
properties of polyphonic drum signals, while higher level knowledge was incorporated by
means of the N-grams, which modelled the likelihood of a given set of events occurring
in succession. The system described attempted to transcribe mixtures of seven classes of
drum type. These were snares, bass drums, tom-toms, hi-hats, cymbals, ride cymbals and
percussion instruments. In this case percussion instruments is taken to mean all
percussion sounds not contained in the other classes. Each drum type was allocated a
symbol, which together make up an alphabet Σ. These symbols can then be combined to
generate ‘words’. A ‘word’ is interpreted as representing a set of drum types that are
played simultaneously at a given moment in time, and a word which contains no symbols
is interpreted as silence. For a given number of symbols, n, the total number of words
possible is 2n. In this case, where n = 7, this results in a total vocabulary, V, of 128 words.
In order to analyse a given percussive music performance, the tatum, or smallest
metrical unit or pulse length of the performance was determined. Methods for
determining the tatum are discussed below in Section 2.2. Once the tatum has been
determined, a grid of tatum pulses is then aligned with the performance and each drum
event is associated with the nearest grid point. Events which are assigned to the same grid
point are taken as being simultaneous, and the percussive music performance can then be
described by a string of words, with one word per grid point. Grid points with no
associated drum event result in the generation of an empty word. Prior probabilities for
each word can then be estimated from a database of rhythm sequences by counting the
occurrences of a given word and dividing by the total number of words in the database.
19
Music Information Retrieval Methods
An N-gram uses the N-1 previous words to generate a prediction of what the next
word will be. Using the notation in [Paulus 03], let w1K represent a string of words
w1,w2,…,wk. The probability of a word sequence can then be calculated as:
( ) (
P w1K = ∏k =1 P wk wkk−−1N +1
K
) (2.10)
( )
P wk wkk−−1N +1 is the probability of the current word given the N-1 previous words. The
( ) CC ((ww )
k
P wk wkk−−1N +1 = k − N +1
k −1
k − N +1 ) (2.11)
earlier, i.e. at wkk−−(LN +1) L . In this case, the period L was set to be the bar length of the
20
Music Information Retrieval Methods
(
P wk w1K =) ∏ (
P s n w1k −1 ) ∏ (1 − P(s m w1k −1 )) (2.12)
sn ∈wk sm ∉wk
e = ∑i
(ℵ(w i
R
) ( ( ) ( )))
∩ wiT + max 0, ℵ wiT − ℵ wiR
∑i ℵ wiR ( ) (2.13)
where i runs through all the test grid points, wiR is the set of symbols present in the actual
word at point i in the grid, wiT is the set of symbols found in the word obtained from the
transcription system and ℵ denotes the cardinality of the set. Upon addition of the word
priors to the system this error rate dropped considerably to 49.5%. The further addition of
21
Music Information Retrieval Methods
the various N-gram models all gave some degree of improvement, the lowest error rate of
45.7% occurring upon addition of a symbol N-gram with N equal to 10.
The system described represents an attempt to overcome the problem of
dealing with mixtures of drums by explicitly modelling the various combinations of drum
types possible. However, given that there are large variations within timbre within any
drum type even before attempting to deal with mixtures of drum types, it is not surprising
that the error rate was high for the system using only the drum mixture models. The
addition of prior probabilites for each possible word did much to improve the situation,
and shows the utility of incorporating prior knowledge into transcription systems. The
use of N-grams, while reducing the error rate, did not dramatically improve the
transcription results.
As can be seen from the above there has been very little work done specifically on
the task of automatic drum transcription, and what little has been done shows, in most
cases a lack of systematic evaluation of the performance of the drum transcription
systems proposed. Only the work of Paulus and, to a lesser extent, Sillanpää present any
results obtained from testing. As a result, the problem of automatic drum transcription
lags considerably behind that of the automatic transcription of pitched instruments where
considerable effort has been expended by numerous researchers down through the years.
Zils et al proposed a system for the automatic extraction of drum tracks from polyphonic
music signals [Zils 02]. The method was based on an analysis by synthesis technique.
Firstly simple synthetic percussive sounds are generated. These simple percussive sounds
consist of low-pass filtered impulse responses, and a band-pass filtered impulse response,
which are very simple approximations to kick drums and snare drums respectively.
A correlation function is then computed between the signal S(t), and the synthetic
percussive sound I(t):
NI
Cor (∂ ) = ∑ S (t )I (t − ∂ ) (2.14)
t =1
where Ni is the number of samples in the synthetic drum percussive sound and Cor(∂) is
defined for ∂∈[1,Ns] where Ns is the number of samples in S(t). The correlation technique
22
Music Information Retrieval Methods
was found to be very sensitive to amplitude and so a number of peak quality measures
were introduced to eliminate spurious peaks.
Firstly the proximity of a peak to the position of a peak in signal energy was used
to determine if the correlation peak corresponds with a percussive peak. Secondly the
amplitude of the peak in the correlation function was observed and low peaks discarded.
Finally, the relative local energy in the correlation function was measured from :
Cor (t )2
Q(Cor , t ) = width
(2.15)
t+
1 2
width
∑widthCor (i )2
i =t −
2
where width is the number of samples chosen over which to evaluate the local energy.
These measures eliminate a number of incorrect peaks.
However, due to the simplicity of the initial model there may still be a number of
peaks that do not correspond to correct occurences, or there may be a number of
undetected events. To overcome this a new percussive sound is generated based on the
results of the initial peak detection. A simplified approximation to the re-synthesis
(omitting the necessary centering and phase synchronisation of occurences of each drum)
is given by:
1 1 npeaks
newI (t ) = I (t ) + ∑ S ( peakposition(i ) + t ) (2.16)
2 npeaks i =1
where npeaks is the number of peaks detected. This results in a new percussive sound
which can then be correlated with S(t) to yield improved estimates of the occurences of a
drum sound. The process is then repeated until the peaks do not change from iteration to
iteration, or until a fixed number of iterations have been completed.
This process is carried out for both kick and snare drum. To avoid problems due
to simultaneous occurences of the two drums, priority is given to the bass drum. As a
result, analysis to determine the presence of a bass drum is carried out first and then
further analysis is carried out to obtain snare drum occurences that do not conflict with
the previously detected bass drum occurences. This means that the system is limited to
monophonic transcription of snare and kick drum in the presence of pitched instruments.
23
Music Information Retrieval Methods
24
Music Information Retrieval Methods
Another problem with the approach lies in the fact that it does not attempt to deal
with the simultaneous occurrence of drum sounds. By giving priority to the bass drum it
will miss any snares that overlap with the bass drum, resulting in incorrect transcription,
and as noted previously this effectively limits the system to monophonic transcription of
snare and kick drums. Despite this, the system is noteworthy for being one of the few
attempts to date to transcribe drums in the presence of other instruments.
A very recent attempt at transcribing drums in the presence of pitched instruments
was made by Virtanen [Virtanen 03]. However, as the system described makes use of an
information theoretic approach, it was felt that the system would best be described in the
context of other information theoretic approaches. Details of this system can be found in
Section 3.4 of chapter 3.
Beat tracking deals with identifying the regular pulse or beat of a piece of music, while
rhythm recognition takes this task a step further by attempting to identify the accented
pulses that result in the perceived rhythm of a piece of music. This review concentrates
mainly on systems that use audio as an input to the system as opposed to those that use
symbolic data such as MIDI. Examples of systems that carry out rhythm recognition on
symbolic data can be found in the work of Rosenthal [Rosenthal 92] and that of Cemgil
[Cemgil 00]. While being separate and distinct tasks from the problem of drum
transcription, beat tracking and rhythm recognition can be used as an aid to the process of
drum transcription. The information obtained from beat tracking and rhythm recognition
can potentially be used to resolve ambiguities and correct errors in the system by making
predictions about future events in the audio signal. Identification of the downbeat in a
piece of music can also be used to aid in the process of drum identification. For example
in pop music, a bass drum generally plays on the downbeat and a snare on the upbeat.
The use of beat tracking and rhythm recognition can make it easier to incorporate musical
knowledge, such as the examples given above, into the overall system.
The beat tracking system proposed by Scheirer [Scheirer 98] uses a bank of six
band-pass filters to process the incoming signal. For each band the derivative of the
amplitude envelope is obtained. The resultant derivatives are then each passed through a
25
Music Information Retrieval Methods
bank of parallel comb filters. Each of these comb filters is tuned to resonate at a
particular frequency and will phase lock with an incoming signal of that frequency. The
phase locked filters are then tabulated for each of the band-pass filters and the results
summed across the entire frequency band to obtain a tempo estimate. The phase of the
rhythmic signal is obtained from the comb filters and is used to identify the 'downbeat' of
the rhythm.
The system was tested on 60 samples of music from a wide variety of musical
styles, both with and without percussion. The correct beat was tracked in 41 cases, was
approximately right in 11 cases, and wrong in eight cases. The system was also able to
respond to tempo modulations in music. In comparison with human listeners the
algorithm was found to be closer in most circumstances to a set of previously marked
beats. The regularity of the beats was also found to be more accurate than that of human
listeners.
Smith [Smith 99] used wavelets to carry out analysis of rhythmic signals. The
rhythmic signal was viewed as an amplitude modulation of the auditory frequency
ranges. This amplitude modulation was made explicit by using the signal energy to
establish the modulation independently of the audio signal. Wavelet analysis was then
carried out on the signal energy, which was sampled at a sampling rate of 400 Hz making
the rhythms at different time scales explicit. The analysis was used to make explicit
accents, tempo changes and rubato. The analysis was then used to generate a rhythm that
"tapped along" to a given rhythm. The system was capable of generating rhythms that did
match those of the input signal.
A real time system for beat tracking was implemented by Goto and Muraoka
[Goto 94a]. The system ran in real-time on a parallel computer and was capable of
tracking beats in pop songs with drums. The system made use of basic musical
knowledge, such as the fact that a bass drum can be expected on the downbeat (beats 1
and 3) and the snare drum on beats 2 and 4. The real time signal being input was
converted to the frequency domain where onset components were extracted by detecting
frequency components whose power had been increasing. The onset time was then found
by carrying out peak finding on the sum of the degree of onset for each frequency
detected. Multiple onset time finders, each with different sensitivities and frequency
26
Music Information Retrieval Methods
ranges, were used. The onset times from each of these were passed to an associated pair
of agents.
The type of beat was detected by finding peaks along the frequency axis and
forming a histogram from them. The lowest peak on the histogram gives the
characteristic frequency of the bass drum, while the largest peak above the bass drum
peak is chosen to represent the snare drum.
Once the information from the frequency analysis is passed to the agent pairs,
each pair creates hypotheses for the predicted next beat time, the beat type, the current
inter-beat interval, and the reliability of these hypotheses. The agent pairs can alter the
parameters of their associated onset finders depending on the reliability of the estimates.
The adjustable parameters for each agent are sensitivity, frequency range and
histogramming strategy, which decides how the inter-offset intervals are used. These
parameters are adjusted if reliability remains low.
The next beat is predicted by adding the current inter-beat interval (IBI) to the
current beat time. The IBI is taken to be the most frequent interval between onsets. The
choice of IBI is weighted by the reliability of these intervals. All the hypotheses
generated by agents are grouped according to beat time and IBI. The group reliability is
taken as the sum of the reliabilities of the hypotheses in the group. The most reliable
hypothesis in the most reliable group is then chosen as being the correct hypothesis to
track the beat. The system was tested using the initial 2 minutes of 30 songs. The tempi
of the songs ranged from 78bpm to 168bpm, and the system tracked the correct beat in 27
out of 30 songs.
Further improvements in the system were described in [Goto 95]. Snare drum
identification was carried out by looking for noise components that were widely
distributed along the frequency axis. Increased use of musical knowledge was
incorporated by the addition of a bank of bass and snare drum patterns commonly found
in music. These patterns were then matched to the drum pattern obtained from the input
signal and this was used to determine the beat type and the corresponding note value. IBI
for each agent was then predicted using auto and cross correlation of detected onset times
to predict the next beat. Lastly, to inhibit double-time and half-time tempo errors, in each
agent pair one agent attempts to track beats at a relatively high tempo, and the other at a
27
Music Information Retrieval Methods
low tempo. These two agents then try to inhibit each other. The system correctly tracked
the beat in 42 of the 44 musical excerpts with which it was tested.
Further improvements in [Goto 98] include the increased use of musical
knowledge. This enabled the system to be extended to drumless music. The system was
now capable of tracking beats at a number of different levels, consisting of the quarter
note level, the half note level, and the measure (or bar) level. The added musical
knowledge can be divided into 3 groups, corresponding to 3 kinds of musical element.
Musical knowledge suggests that onset times tend to coincide with beat times, and that a
frequent inter-onset interval is likely to be the inter-beat interval. Similarly, chord
changes are more likely to occur at beats than in between, are more likely to occur at half
note times than other beat times, and are more likely to occur at the beginning of
measures than at other half note times. Chord changes were recognised by changes in the
dominant frequency components and their overtones in the overall sound. The method
used to carry this out is described in detail in [Goto 97]. The bank of drum patterns
included was further leveraged for information by noting that a recognised drum pattern
has the appropriate inter-beat interval, and the start of the drum pattern is an indication of
the position of a half note within a given bar of music. In tests, the system obtained at
least an 86.7% correct recognition rate at each level of beat tracking, indicating
robustness for beat tracking in music both with and without beats.
McAuley [McAuley 95] used an adaptive oscillator to track beats and tempo. The
adaptive oscillator has a periodic activation function whose period is modified by the
oscillators output. This means that over time the oscillator's period will adjust to match
that of the music. The oscillator is also phase coupled with the input signal by resetting
the phase each time an input to the oscillator exceeds a given threshold. This effectively
is a measure of note onset. The oscillator retains a memory of the phase at which
previous resets occurred and this is used as the oscillator’s output. Because the oscillator
changes relatively slowly the beat tracking is protected from small variations in both
tempo and phase, allowing the system to track beats accurately. The oscillator correctly
tracked 80% of the test patterns when no variations were present, and achieved similar
results when the onset times of the input patterns were allowed to vary up to 10%. The
system was not tested using real audio, but using trains of pulses to generate the rhythm.
28
Music Information Retrieval Methods
Apart from the beat, another measure which has been used for tracking of musical
signals is the tatum. This is defined as the smallest metrical unit or pulse length of the
signal. All other metrical levels such as the beat will be integer multiples of this level.
Seppänen describes a system which attempts to analyse musical signals at the tatum level
[Seppänen 01]. Sound onsets are detected in a manner similar to that of [Klapuri 99],
which has previously been described in section 2.1.1, with onsets being detected in a
number of frequency bands. Inter-onset intervals are then calculated for all pairs of onsets
that fall within a given time window of each other. If there are no random deviations in
the inter-onset intervals then the tatum can be estimated by simply finding the greatest
common divisor of all the inter-onset intervals.
However, in most musical examples there will be deviations in the inter-onset
intervals, and in many cases there will be variations in tempo as well. To allow for such
changes in tempo, a time-varying histogram of the inter-onset intervals is calculated.
Further, to allow for the fact that in most musical performances there will be deviations in
the timing of note onsets, Seppänen makes use of a remainder error function. This
remainder error function is a function of the time period and the inter-onset intervals and
the local minima of this remainder error function represent possible candidates for the
tatum. To eliminate spurious local minima of this function, a threshold is set, and the
tatum is determined to be the most prominent local minimum below this threshold. The
system performed reasonably well in cases where there was a clearly defined regular
rhythm, but performed less well in cases such as orchestral classical music where the
rhythm is less clearly defined.
While beat tracking has been carried out on real audio in a number of systems,
rhythm detection/analysis has still to be adequately demonstrated on real audio. The
systems presented above that work on real audio all make use in some shape or form of a
multi-band approach to beat tracking, and in many ways the problem of beat tracking on
audio signals overlaps with that of note onset detection. As noted previously the detection
of upbeats and downbeats in a signal would be of use in automatically transcribing drums
and is of potential use in resolving ambiguities in the transcription process.
29
Music Information Retrieval Methods
This section deals with signal analysis using sinusoidal modelling and its extensions. The
motivation for using sinusoidal modelling as an analysis technique is that drum
instruments are noise based instruments with no clear pitch associated with them,
whereas instruments such as guitars and pianos have associated pitches. The fact that
these instruments are pitched means that they are mainly composed of harmonic partials
that can be successfully modeled as sinusoids. Sinusoidal modelling can then be used to
extract these pitches from the original signal, leaving behind a signal that contains mainly
the drum sounds, as well as some noise that is associated with the pitched instruments. It
has been observed by Sillanpää et al that this is a valid assumption, with the drums being
louder relative to other elements in the signal after removal of the sinusoids [Sillanpää
00]. Once the sinusoids have been removed the remaining noise signal can then be further
analyzed to model the initial transients, as well as the residual noise spectrum. The signal
model for sinusoidal modelling as well as transient and residual modelling is described
below.
30
Music Information Retrieval Methods
signal
Short Time
Fourier Transform
Spectrum
Peak Detection and
Parameter Estimation
Sinusoidal Peaks
Peak Continuation
Partial Tracks
Synthesis of
Sinusoids
Synthesised sinusoids
31
Music Information Retrieval Methods
Once these parameters have been estimated for the peaks of each frame, these
peaks are then connected to form partial tracks using a peak continuation algorithm. This
algorithm tries to find appropriate continuations for existing partial tracks from the peaks
of the next frame. Once the entire signal has been processed by the peak continuation
algorithm, the partial tracks contain the necessary information for re-synthesis of the
sinusoids. The sinusoids are then synthesised by interpolation of the partial track
parameters, with cubic interpolation used to ensure smooth phase, and the summation of
the resulting waveforms in the time domain. The noise residual is then obtained by
subtracting the synthesised sinusoids from the original signal. The resulting noise signal
is then represented either as time varying filtered white noise, as the short time energies
within certain frequency bands, or by using Linear Predictive Coding (LPC) based
waveform coding techniques.
Since the initial implementation of sinusoidal modelling by McAulay & Quatieri
[McAulay 86] in 1986 there have been numerous variations on the basic sinusoidal
model. The most important of these was the introduction of the concept of the noise
residual as introduced by Serra [Serra 89] to create what has become the standard model
for sinusoidal modelling. These variations attempt to improve aspects of sinusoidal
modelling and are discussed in the following sections.
The first step in sinusoidal modelling is the detection of peaks that represent sinusoids in
the spectrum. This is a crucial step in that only peaks that have been detected can be re-
synthesised. In this section issues related to detection of peaks and methods of detecting
peaks are discussed.
There are a number of problems related to the detection of peaks, most of which
are related to the length of the analysis window used. A short window is required to allow
the detection of rapid changes in the signal, but a long window is necessary to estimate
accurately the frequencies of the sinusoids, especially low frequency sinusoids, and to
distinguish between spectrally close sinusoids. This is as a result of the time-frequency
resolution trade-off associated with the STFT, where increased time resolution means
poorer frequency resolution and vice-versa. Attempts to overcome this problem have
32
Music Information Retrieval Methods
been made by using such transforms as the constant-Q transform (CQT), where the
window length is inversely proportional to the frequency, and the frequency coefficients
are spaced on a log scale, thus optimising the time-frequency trade-off in each frequency
band [Brown 92]. However it has been observed in [Klapuri 98] that the practical
implementation of the CQT involves the combination of several coefficients of a Fast
Fourier Transform (FFT), so in comparison to using several FFTs of different resolution
there is no real gain to be obtained. As a result a number of schemes have made use of
several FFTs with different window lengths at different frequency bands [Klapuri 98],
[Virtanen 01].
The presence of a sinusoid is indicated by a peak in the magnitude of the FFT,
and the simplest way to detect sinusoids in a signal is to take a fixed number of peaks
from the magnitude of the FFT. However, this is not practical for analytical purposes,
where taking a fixed number of peaks can cause problems. Taking too large a number of
peaks would result in the detection of peaks which are caused by noise, instead of being
caused by sinusoids. Conversely taking too small a number of peaks could result in some
peaks due to sinusoids not being detected. This is particularly true for polyphonic signals,
where there will be a large number of peaks. Therefore a natural improvement upon
taking a fixed number of peaks is the use of a threshold above which the peak is regarded
as a sinusoid. However peaks due to noise can still be detected using this method, and the
threshold has to be chosen carefully as the amplitude of the partials of natural harmonic
sounds tend to fall with increasing frequency.
As a result more sophisticated methods such as cross correlation [Doval 93] and
the F-test method [Levine 98] have been proposed for the detection of sinusoid peaks in
the frequency spectrum. In tests by Virtanen on peak detection using synthetic test
signals, the fixed threshold method, cross-correlation and the F-test were compared for
peak detection [Virtanen 01]. The tests showed that the F-test performs worse than the
other two methods across a wide range of test signals, and that the fixed threshold method
was robust, outperforming the cross-correlation method in many cases. It was found that
the only case where the fixed threshold method was drastically worse than the other
methods was in the case of sinusoids with exponentially decaying amplitudes.
33
Music Information Retrieval Methods
Due to the nature of the FFT, where each coefficient represents a frequency interval of
Fs/N where Fs is the sampling frequency, and N is the length of the FFT, the parameters
of the peaks do not accurately give the parameters of the sinusoids detected. Zero
padding can be used to improve the resolution of the FFT, but without the use of
impractically long window lengths the desired resolution cannot be obtained from zero
padding. Therefore, some other means must be used to obtain estimates of the sinusoid
parameters.
The most common method for estimating the parameters is the use of quadratic
interpolation. If a symmetric window is used when windowing the original signal, then a
quadratic function will give a good approximation of the actual sinusoid parameters
[Rodet 97]. The quadratic expression can be estimated using only 3 FFT coefficients. It
has been observed in [Virtanen 01] that it is better to estimate the parameters using the
log of the absolute values of the FFT coefficients.
The second method used is that of signal derivative interpolation [Desainte 00].
This uses the FFT of a signal and its derivatives to approximate the exact frequencies and
amplitudes of the sinusoids. It has been reported that the performance of this method is
nearly equal to that of the quadratic interpolation [Virtanen 01].
Another method is iterative least-squares estimation, as devised by Depalle and
Helie [Depalle 97]. Using estimates obtained from another estimation method such as
quadratic estimation, the amplitudes and phases are estimated assuming the frequencies
are correct. The frequencies are then re-estimated assuming that the amplitudes and
phases are correct. This process is repeated until convergence of the estimated parameters
is obtained. Because the algorithm used is sensitive to the presence of sidelobes in the
window function used to carry out the STFT, the window function must have no
sidelobes to ensure convergence. This iterative method is said to reduce the size of
window needed for accurate parameter estimation by a factor of 2. However, the method
becomes computationally expensive for large numbers of sinusoids, and it has been
reported that the method has problems dealing with closely spaced sinusoids and complex
polyphonic signals [Virtanen 01].
34
Music Information Retrieval Methods
Iterative analysis of the residual has also been used to obtain sinusoid parameters
that may have been missed in the original analysis [Virtanen 01a]. In this method the
sinusoids are detected and estimated using one of the previous methods described. The
partials are then tracked from frame to frame, and once the analysis is complete the
sinusoids are then synthesised and subtracted from the original signal to obtain the noise
residual. The residual is then analysed to obtain further sinusoids. This process can be
repeated as long as is required.
Having detected the peaks and determined the sinusoid parameters, the next step in
sinusoidal modelling is the linking of peaks from frame to frame to create partial tracks
which track the evolution of sinusoids over time. At each frame a peak continuation
algorithm tries to connect the sinusoidal peaks in the frame to already existing partial
tracks from previous frames. This algorithm searches for the closest match between peaks
in adjacent frames. If a continuation is found then the two peaks involved are removed
from consideration, and the search continues with the remaining peaks. If no suitable
continuation is found for a given partial track, then it is assumed that the sinusoid related
to the partial track has faded out and so that partial track is said to have died. If a peak in
the current frame does not represent a continuation of an existing track then it is assumed
to be the start of a new sinusoidal component and a new track is formed.
There are a number of methods available for deciding the closest match between
peaks, the simplest method being based on the closeness of frequencies from peak to
peak. As human pitch perception is effectively logarithmic in nature this is usually done
by taking the log of the frequencies. The closeness of frequencies from peak to peak is
estimated by subtracting these log values from each other, or in other words by taking the
first order difference between peaks. As subtracting logs is the same as the log of a
division, the closeness measure becomes the log of the ratio of the frequencies. This
method can be improved upon by taking into account the closeness of amplitudes and
phases when deciding the best continuation [Virtanen 01]. It is usual to set bounds for
allowable frequency and amplitude deviation to eliminate continuations that are not likely
to occur in real sounds, and also to reduce the number of possible peak pairs.
35
Music Information Retrieval Methods
In a number of systems partials which are of too short a duration are removed.
The reason for doing this is that not all peaks are associated with stable sinusoids. In
some cases these peaks will find a continuation in the next frame. However these tracks
are normally only a couple of frames long. Also, true sinusoid tracks of a couple of
frames duration are too short to be treated as a pitch by our ears. As a result tracks of
short duration are removed before synthesis.
2.3.5 Synthesis
Once the partial tracks have been obtained there remains the task of synthesising the
partial tracks. There are two methods of synthesising the partial tracks, synthesising the
partials in the time domain, and the Inverse Fast Fourier Transform (IFFT) method.
The partial tracks contain all the information required to synthesise the sinusoids,
but to avoid discontinuities at frame boundaries the parameters are interpolated between
frames. The amplitude is linearly interpolated, but phase interpolation is done using cubic
interpolation. This is because the instantaneous frequencies are the derivative of the
phase, and so the frequencies and phases of the two adjacent frames being interpolated
have to be taken into account. Another method of interpolating the phase is proposed in
[Bailly 98], where the phase is interpolated using quadratic splines fitted using least
squares criteria. Once the parameters have been interpolated for each partial track, the
sinusoids are then synthesised and summed:
N
s (t ) = ∑ ai (t )Cos (θ i ) (2.18)
i =1
where s(t) is the signal composed of the summed sinusoids, ai(t) is the amplitude of
sinusoid i at time t, and θi is the instantaneous phase.
The IFFT method is carried out by filling FFT bins with a number of points for
each partial, and then carrying out an IFFT on the resulting spectrum [Rodet 92]. This
IFFT method is computationally more efficient than synthesis in the time domain, but
time domain synthesis still remains the more popular method in sinusoidal modelling
systems due to the greater accuracy in reconstruction of the original signal. This is
because of the principal disadvantage of the IFFT method, which is that the re-synthesis
parameters are fixed for the duration of the frame, as opposed to being interpolated
36
Music Information Retrieval Methods
between frames as in the time domain method. As a result, the interpolated approach
gives better estimation of the sinusoids than the IFFT method.
Once the sinusoids have been synthesised the residual can be obtained by subtracting the
sinusoidal signal from the original signal. The residual can be modeled in a number of
ways. The first method developed by Serra involved taking a Short Time Fourier
Transform (STFT) of the residual. The noise envelope of the spectral magnitude for each
frame is then modeled by piecewise linear approximation. The signal is then re-
synthesised by carrying out an IFFT using random phase with the line-segment
approximated magnitude envelope [Serra 89].
A modification of this scheme is to split the signal into Bark-scale bands [26].
This is motivated by the fact that the ear is insensitive to changes in energy of noise
sounds within Bark bands. The STFT is taken as before, and then the power spectrum is
obtained by taking the square of the magnitude of the STFT. The Bark band z
corresponding to the frequency f in kHz is estimated from [Zwicker 99]:
2
f
z ( f ) = 13 arctan(0.76 f ) + 3.5 arctan (2.19)
7.5
The energy within each Bark scale band is then calculated by summing the power
spectrum values within each band. For re-synthesis the magnitude spectrum is obtained
by dividing the Bark band energy by the bandwidth, and taking the square root. The
signal is then re-synthesised as described above.
It should be noted that both of these methods can result in the degradation of
transients that occur within frames. In an effort to overcome this, a system designed to
explicitly model transients has been developed by Verma & Meng [Verma 00]. This
system is called transient modelling synthesis (TMS).
In TMS the initial transient of a sound is analysed by transforming the sound with
the discrete cosine transform (DCT). The DCT transforms a transient in the time domain
into a slowly varying sinusoid in the DCT domain. A transient that occurs near the start
of the analysis frame results in a relatively low frequency sinusoid in the DCT domain.
Alternatively if the transient occurs at the end of the frame the DCT of this transient is a
37
Music Information Retrieval Methods
relatively high frequency sinusoid. The size of the analysis frame is typically of one-
second duration, resulting in good frequency resolution in the DCT domain. This slowly
varying sinusoid can then be modeled using standard sinusoidal modelling techniques,
i.e. by taking an STFT of the DCT domain signal and carrying out partial tracking on the
resulting spectrum. The STFT frame size used is much shorter than the DCT frame size,
with typically 30-60 STFT frames per DCT frame, which results in good time resolution
of the sinusoidal model of the transient.
The resulting partial tracks can then be re-synthesised from the extracted
information and the resulting transient signal returned to the time domain by carrying out
an inverse DCT. The modeled transients are then subtracted from the original signal,
leaving a residual signal containing the sinusoids and noise. This residual is then analysed
using the sinusoidal plus residual approach.
Another method of modelling the residual is the use of LPC-based waveform
coding techniques [Ding 98]. This time domain technique has been used because it can
model the transients in the residual more accurately. However, this technique does not
model the transients explicitly, that is separately from the rest of the residual. Also this
technique is not a suitable model for further analysis of signals because it is a time
domain based technique, and so contains little information concerning the frequency
content of the residual.
Although work on Instrument Identification systems has been going on for the past 30
years, it is only in the past few years that systems that are capable of working with
musical recordings have emerged. Systems such as those described by Brown [Brown
97], Dubnov and Rodet [Dubnov 98], and Marques [Marques 99] used various schemes
to classify limited numbers of instruments, but in terms of generality and range of
instruments described, the systems described below are the most successful to date.
Eronen and Klapuri [Eronen 00] made use of both spectral and temporal features
in their instrument identification scheme, in an attempt to overcome a perceived
shortcoming in previous systems. Previous schemes tended to emphasise either temporal
38
Music Information Retrieval Methods
or spectral information, but the Eronen/Klapuri scheme made greater use of both
temporal and spectral information than prior schemes.
The parameters were estimated by a variety of means. From the short-time rms-
energy envelope, they estimated parameters for the overall sound such as rise time, decay
time and the strength and frequency of amplitude modulation. The spectral centroid of the
sounds were also obtained. Using Bark scale bands they measured the similarity of the
amplitude envelopes of the individual harmonics to each other and modeled the spectral
shapes of the envelopes using cepstral coefficients. The system used two sets of 11
coefficients averaged over the sounds onset and the rest of the sample. The instruments
were then classified using Gaussian and k-NN classifiers [Everitt 93].
Using this system the correct instrument family (e.g. brass, strings, reeds etc.) was
recognised with 94% accuracy and the correct instrument was identified with 80%
accuracy. This compared favourably with the system implemented by Martin [Martin 98],
which had a success rate of 90% for identifying instrument family and a success rate of
70% for identifying individual instruments.
Martin's system made use of the log-lag correlogram representation, a description
of which can be found in Section 2.6.1. He extracted 31 features of the sounds presented,
including pitch, spectral centroid, onset duration and features relating to vibrato and
tremolo of the sounds. As the use of 31 parameters required an extremely large training
set, Fisher Multiple Discriminant Analysis was used to reduce this set to a more
manageable size [Subhash 96]. Multiple Discriminant Analysis is used to determine
which variables discriminate between two or more groups, in this case, the different
instruments. Those variables which are found to have little or no discriminating power
can then be removed from further analysis. Martin also made use of a hierarchical
structure to classify the instruments at a variety of levels. Again, k-NN classifiers were
used to make these decisions. The instruments were then classified into instrument
families such as brass and strings, and finally to individual instruments.
Fujinaga used spectral information such as centroid and skewness to carry out
instrument recognition [Fujinaga 98], [Fujinaga 99]. The system involved again used k-
NN classifiers. Fujinaga made use of a genetic algorithm to arrive at the best set of
feature weightings to use with the k-NN classifier for instrument identification. Initially
39
Music Information Retrieval Methods
the steady state portion of the instruments was used, but it was found that the use of the
dynamic portion of the sound resulted in increased recognition. A recognition rate of 64%
for individual instruments was reported in [Fujinaga 98]. The addition of further
parameters such as spectral irregularity and the use of the time domain envelope further
increased the recognition rate to 68% [Fujinaga 00]. It should be noted that Fujinaga's
system appears to be the most computationally efficient system to date.
All of these systems used the same set of training and test data [Opolko 87], and
so comparisons between these systems are valid. Indeed, Eronen and Klapuri went as far
as using the same 70:30 split between training and test data as Martin to allow direct
comparison.
More recently the automatic classification and labeling of unpitched percussion
sounds was investigated by Herrera et al [Gouyon 01], [Herrera 02], [Herrera 03]. A large
database of nearly 2000 percussive sounds from 33 different types of percussive
instruments, both acoustic and synthetic, was used in [Herrera 03]. A large number of
temporal and spectral descriptors were calculated for each example in an effort to find
feature sets which provided maximal discrimination between the different types of
percussive sound.
A number of different types of descriptors were used in classifying the percussive
instruments. As Mel-Frequency Cepstrum Coefficients (MFCCs) have proved useful in
instrument recognition, [Eronen 01], these were used as a set of descriptors. In this case
13 MFCCs were calculated for each example and their means and variances were
retained and used as descriptors. Secondly the energy in each Bark band was calculated
for each example. However, to improve resolution in the low-frequency region of the
spectrum, the bottom two Bark bands were split into two half bands each. This gave 26
bands across the frequency spectrum. The energy proportions in each band and their
variance in time were taken as descriptors.
A further set of descriptors was then derived from the Bark band energies. These
new descriptors included identifying the band with maximum energy, the band with
minimum energy, the proportion of energy in these bands, the bands with the most and
least energy variance, and ratios between low, mid and high frequency regions. The
40
Music Information Retrieval Methods
overall standard deviations of the energy proportions and the energy variances were also
calculated.
Another set of spectral descriptors was then derived from the FFT of each
example. These spectral descriptors included spectral flatness, calculated as the ratio
between the geometric mean and the arithmetic mean of the spectrum, the spectral
centroid, spectral skewness and kurtosis [Kendall 87]. Also calculated was the number of
spectral crossings and a ‘strong peak’ measure, which indicates the presence of a very
pronounced peak. The thinner and higher the maximum peak of the spectrum the higher
this measure.
A number of temporal descriptors were also calculated such as the zero-crossing
rate and a “strong decay” measure, where a sound of high energy containing a temporal
centroid near its start is considered to have a “strong decay”.
The use of the above resulted in a set of 89 descriptors describing properties of
the percussion sounds. In order to reduce this number of descriptors down to a set of
features with the most discriminating power a technique called Correlation-based Feature
Selection (CFS) was used [Hall 00]. This technique measures a ratio between the
predictive power of a given set of features and the redundancy inside the subset. This
resulted in a feature set of the 38 best descriptors. A second set of descriptors, obtained
from the natural log of the descriptors, apart from the MFCC-based descriptors which
were left unchanged, was also tested for discriminating between the drum classes. The
use of the natural log of the descriptors was motivated by the fact that some of the
classification techniques assumed Gaussian distributed data. It was observed that many of
the discriminators were highly non-Gaussian. However, taking the natural log of the
discriminators resulted in features that were closer to Gaussian in distribution. Using CFS
on the log descriptors resulted in a feature set of 40 descriptors.
A number of different classification techniques were tested such as kernel density
estimation, k-nearest neighbours and decision trees [Everitt 93]. Testing of these
techniques was done using ten-fold cross-validation. Ten randomly chosen subsets
containing 90% of the training examples were used for training the models associated
with each technique. The remaining 10% in each case were then kept for testing.
41
Music Information Retrieval Methods
The results obtained showed that the best performances were obtained when using
a kernel density estimator in conjunction with the log CFS descriptors. This gave a
success rate of 85.7% correct. Using a 1-nearest neighbour rule a similar score of 85.3%
was obtained, again using the log CFS features. Working with a smaller set of drums,
namely snare, bass drum, tom-toms, hi-hats, open hi-hats, ride cymbals and crash
cymbals a success rate of 90% was achieved [Herrera 02]. The results obtained highlight
the fact that the choice of descriptors is of crucial importance. Given a good set of
descriptors the choice of classification algorithm becomes less important. However, to
date no attempt has been made to identify mixtures of drum sounds, which is critical for
successful drum transcription.
All the above schemes make use of classifiers as a means of identifying the
instruments and make use of a large number of parameters. The large number of
parameters was unwieldy for analysis and they were mapped down onto a smaller
parameter space where instrument identification took place. This approach could also be
used in the identification of various types of drums. The principle disadvantage of these
systems is the large number of examples required to train the system, and the time it takes
to do so. Another problem with these systems is that they only attempt to identify
instruments in isolation which is an unrealistic assumption when attempting polyphonic
drum transcription.
Attempts at automatic polyphonic music transcription date back to the 1970s, but until
recently these systems were very limited in their scope, allowing only 2 or 3 note
polyphony and a very limited range of instruments [Moorer 77], [Chafe 86]. Despite
these limitations, the resultant transcriptions were often unreliable. Recent schemes offer
greater generality, though there are still limitations on these systems. While the
transcription of pitched instruments is a completely separate task from that of transcribing
percussion instruments such as drums it was felt that the methods employed may provide
some clues as to how to tackle the problem of drum transcription.
The scheme proposed by Martin [Martin 96] involves the use of a blackboard
system to collect information in support of note hypotheses (i.e. whether or not a note is
42
Music Information Retrieval Methods
present). The use of the term blackboard comes from the metaphor of a group of experts
standing around a blackboard tackling a given problem. The experts stand there watching
a solution develop, adding their particular knowledge when the need arises. Blackboard
systems can make use of both bottom-up (data driven) processing, and top-down
(explanation/prediction driven) processing. They are also easily expanded with new
knowledge sources being added or removed as required.
Initially an FFT based sinusoid track time-frequency representation was used,
though in later versions this was replaced by a log-lag correlogram representation [Martin
96a]. A description of the log-lag correlogram is to be found in section 2.6.1 which deals
with CASA. The switch to this representation was motivated by evidence that an
autocorrelation representation would make detection of octaves easier, as well as making
the model representation closer to that of human hearing.
The transcription system consists of a central dataspace (the "blackboard"), a set
of knowledge sources (the "experts") and a scheduler. The system as implemented
effectively searches for peaks in each correlogram frame. Re-occurring peaks are joined
together to form periodicity hypotheses, which in turn are used to generate note
hypotheses. The autocorrelation-based system contained little or no top-down processing,
and the transcription results suffered accordingly. The sinusoid track version did contain
top-down processing and as such obtained better overall results.
Walmsley et al. [Walmsley 99, 99a, 99b] describes a system that uses a Bayesian
modelling framework to carry out the transcription. A sinusoid representation was used
as the time frequency representation of the sounds, and Monte Carlo Markov Chain
methods were used to estimate the harmonic model parameters. Bayesian modelling was
used to incorporate prior musical and physical knowledge into the system, which began
with sinusoids and then successively abstracted up a hierarchy from sinusoids to notes,
chords and so on.
The results presented are limited to one example, which appears to have been well
transcribed, apart from an error due to the occurrence of a perfect fifth. Further examples
would have been of use in evaluating this system. Proposed system extensions
concentrate on the addition of further knowledge sources.
43
Music Information Retrieval Methods
44
Music Information Retrieval Methods
where X(k) is the power spectrum. J is chosen in such a manner that additive noise is
greatly reduced. Convolutive noise is then removed by calculating a moving average over
Y(k) on an ERB critical band frequency scale. This average is then subtracted from Y(k)
to eliminate convolutive noise.
Next the number of voices or notes present at a given moment in the signal is
estimated. This involved a two-stage process, firstly the presence or otherwise of a
harmonic voice or note was determined, followed by estimating the number of notes
present.
Pitch estimation then takes place for the most predominant pitch in the mixture in
a manner similar to that of the earlier system, though with the addition of a multiband
approach which finds evidence for a given pitch across a number of bands. The spectrum
of the detected note is then estimated and subtracted from the mixture. Estimation of the
number of notes remaining was then carried out and the next most predominant pitch
estimated and subtracted, and so on until the system decides there are no more pitches left
to be transcribed. Window lengths of 90-200 ms were used, with the accuracy of pitch
detection increasing with window length. However, due to the duration of some notes it
was not always possible to use the longer window.
The system was tested in the presence of noise, specifically drum sounds, and
error rates of 11% were reported for mixtures of two notes using a 190ms window. The
error rate increased to 39% for six note polyphonies. Using a 93ms window the error
rates were reported at 20% for two notes and 61% for six notes. The system was reported
to have a performance comparable to that of trained musicians in identifying the notes in
chords, but when tested on musical examples generated from MIDI files the performance
was found to degrade. The algorithm reportedly performs best on signals containing
acoustic instruments with no drum sounds.
All the systems described make use of harmonic grouping to attempt the
transcription of the pitched instruments. Some of them are limited by the use of tone
models to dealing with specific instruments, though some such as that of Klapuri are
completely general. The approaches applied to the problem of polyphonic music
transcription are of potential use in attempts to separate drum sounds. Of particular
interest are the various ways of incorporating high-level musical and physical knowledge,
45
Music Information Retrieval Methods
such as blackboard architectures, which could also be of use in the problem of drum
separation.
This section deals with a number of different methods of sound source separation,
ranging from general schemes such as the Computational Auditory Scene Analysis
scheme proposed by Ellis [Ellis 96], to those that are limited in scope, such as systems
designed to separate pitched instruments. As will be seen a number of completely
different approaches have been used, each making use of different properties of the
signals investigated. A number of other sound separation schemes based on information
theoretic principles will be discussed in Chapter 3.
46
Music Information Retrieval Methods
are time, frequency (cochlear position) and lag (or inverse pitch). If a signal is nearly
periodic, then the autocorrelation of the signal in each filterbank will lead to a 'ridge'
along the frequency axis. This 'ridge' occurs at the lag corresponding to the periodicity of
the original signal.
Three generic sound elements were implemented in the model. The 'noise cloud'
is to represent sound energy in the absence of periodicity. It is modeled as a static noise
process to which a slowly varying envelope is applied. 'Transient clicks' are used to
model brief bursts of energy. 'Wefts' are used to represent wideband periodicities. The
ridges in the lag-frequency plane are traced through successive time steps of the log-lag
correlogram to obtain a weft.
The 'blackboard' is then used to create and modify the sound elements depending
on current predictions and new information provided by the front end. Elements are
added or removed depending on the shortfall or excess of predictions made by the
system.
Re-synthesis of the elements at the end was not always completely satisfactory.
Clicks and wefts arising from speech did not fuse perceptually. In subjective listening
tests there was a strong correspondence between events recorded by listeners and the
elements generated by analysis. However, a number of short wefts identified by the
system were not identified by listeners. Also, the system only provided a very broad
classification of sounds into three basis types and the re-synthesis quality was reportedly
poor.
The CASA system proposed by Godsmark and Brown [Godsmark 99] was again
based on blackboard architecture. In this system the hypothesis formation and evaluation
experts were trained to ensure that they were consistent with known psychoacoustic data.
The system made use of the concept of an Organisation Hypothesis Region (OHR), a
temporal window that passes over the auditory scene. Within the OHR, grouping of
sound elements remains flexible and many hypotheses on a variety of levels can be
tested, but once elements pass beyond the window limits a fixed organisation is imposed.
The front end of the system was a bank of gammatone filters. Instantaneous
frequencies and amplitudes were calculated for each filter’s output, and place groups
(largely equivalent to frequency tracks) calculated. These place groups are then collected
47
Music Information Retrieval Methods
into synchrony strands on the basis of temporal continuity, frequency proximity and
amplitude coherence. When the place groups pass beyond the OHR, their organisation is
fixed according to the best hypothesis and a local evaluation score is generated. A global
hypothesis score is then generated by the sum of these scores over time. This score is
then used to decide between competing hypotheses.
Emergent properties such as timbre and fundamental frequency are then evaluated
by carrying out grouping of place groups by pitch proximity and timbral similarity. A
‘timbre track’ is used to measure timbral similarity. This is a plot of changes in spectral
centroid against changes in amplitude. The similarity between timbre tracks can then be
used as a basis for decisions on the grouping of strands. Then at the highest level of the
blackboard experts are used to identify meter and repeated melodic phrases.
The system was evaluated on its ability to distinguish interwoven melodies and
outperformed listeners in the majority of cases. It was also evaluated on the transcription
of polyphonic music. The system performed well for solo piano, but performance
worsened as complexity increased, with attempts at four-part transcription faring much
worse than the other tests. Though the system was tested on music it is intended to
expand it to be a general CASA architecture.
Like polyphonic music transcription systems, it is the approach to the problem
that is of interest, in particular the ways used to deal with different types of sound. Ellis'
CASA system attempts to deal with both harmonic or pitched signals and transient noise
bursts, which is analogous to dealing with musical instruments and drum sounds. The
concept of a timbre track may also provide a means of easily differentiating partials from
different instruments and provide a means of identifying partials associated with drum
sounds. The use of blackboard systems to incorporate knowledge sources is again a
feature of the systems.
Sinusoidal modelling was used as the basis for the separation of harmonic sound sources
in [Virtanen 02]. The first step in the model was multipitch estimation of the sounds
present in the mixture. This was done in the manner described in [Klapuri 01] and made
use of long temporal windows of length 90-200 ms. While this window length is
48
Music Information Retrieval Methods
necessary to allow detection of the pitches present, it is too long to capture effects such as
amplitude and frequency modulations in the sounds present. As a result sinusoidal
modelling is carried out on the sound, using the pitches found in the multipitch estimation
stage to guide the sinusoidal modelling.
The sinusoidal modelling is carried out using smaller windows than used in the
multipitch estimation stage to allow more accurate determination of the parameters.
Estimation of the amplitude and phase parameters necessary for re-synthesis was carried
out in an iterative manner. First the accuracy of the amplitude and phase parameters was
improved in a least squares sense. Then in order to overcome the problem of overlapping
partials a linear model of the harmonic series was used to force the spectral envelope to
be smooth. Finally accurate estimates of modulations in frequency were determined.
Changes in frequency of components in a sound are tied to changes in the lowest
harmonic of the sound, thus retaining at all times the harmonic structure of the sound, and
so enforcing grouping in accordance to harmonic ratios and frequency modulation. This
procedure is repeated until the estimates converge. This procedure assumes that the
pitches provided by the multipitch estimation algorithm are the correct pitches. If an error
has been made in determining the pitch then the iterative algorithm will not converge.
A number of linear models of the harmonic series were used. These included
spectral smoothing, described in detail in [Virtanen 01b], and other models of the
harmonic structure, such as an approximation to a critical-band filter model of the
harmonic series and an approximation to mel-cepstrum models. These approximations
are described in detail in [Virtanen 02]. The quality of the re-synthesised signals was
found to degrade as the number of pitches increased, regardless of what model of the
harmonic series was used. The critical-band approach and the mel-cepstrum approach
were found to give better results as the number of pitches present at a given instance
increased, with reported signal to noise ratios of 7.4 dBs for mixtures of 5 pitches.
Unfortunately, this sound source separation method is designed to deal with
harmonic sounds and so is not applicable to drum sounds. It is also interesting to note that
the sound source separation system proposed is in effect a two-stage process, firstly a
transcription stage, and then a sound source separation scheme.
49
Music Information Retrieval Methods
The Degenerate Unmixing Estimation Technique (DUET) for sound source separation is
based on the concept that perfect demixing of a number of sound sources from a mixture
is possible using binary time-frequency masks provided that the time-frequency
representations of the sources do not overlap [Rickard 01], [Yilmaz 02]. This condition is
termed W-disjoint orthogonality (W-DO). Unlike the other sound source separation
systems presented in this chapter it does not make use of grouping rules derived from
psychoacoustics, using instead the criteria of W-DO. Stated formally two signals s1(t) and
s2(t) are W-DO if:
sˆ1 (τ , ω )sˆ2 (τ , ω ) = 0 (2.21)
where ŝ (τ , ω ) is a time-frequency representation, such as an STFT, of s(t) where τ
indicates time and ω indicates frequency.
If x(t) is a mixture signal composed of N sources sj(t) where j signifies the jth
source, then:
N
x(t ) = ∑ s j (t ) (2.22)
j =1
Assuming that the sources are pairwise W-DO then only one of the N sources will be
active for a given τ and ω, resulting in:
xˆ (τ , ω ) = sˆ J (τ ,ω ) (τ , ω ) (2.24)
where J(τ, ω) is the index of the source active at (τ, ω). Demixing can then be carried out
by creating time-frequency masks for each source from:
1 s (τ , ω ) ≠ 0
M j (τ , ω ) = j (2.25)
0 otherwise
The time-frequency representation of sj(t) can then be obtained from:
sˆ j (τ , ω ) = M j (τ , ω )xˆ (τ , ω ) (2.26)
Determining the time-frequency masks for a single mixture signal is at present an open
issue, but solutions have been arrived at in cases where two mixture signals are available.
50
Music Information Retrieval Methods
The mixture signals x1(t) and x2(t) are assumed to have been obtained from linear
mixtures of the N sources sj(t). The assumption of linear mixing contains the underlying
assumption that the mixtures have been obtained under anechoic conditions. The mixing
model is then:
N
( )
x k (t ) = ∑ a kj s j t − δ kj , k =1,2 (2.27)
j =1
where akj and δkj are the attenuation coefficients and time delays associated with the path
from the jth source to the kth receiver. For convenience, let a1j = 1 and δ1j = 0 for j =
1,…,N; and rename a2j as aj and δ2j as δj. Taking the STFT of x1(t) and x2(t) then yields
the following mixing model in the time-frequency domain:
sˆ (τ , ω )
xˆ1 (τ , ω ) 1 K 1 1
xˆ (τ , ω ) = −iωδ1 −iωδ N
M (2.28)
2 a1e K aN e sˆ (τ , ω )
N
Defining R21(τ, ω) as the element–wise ratio of the STFTs of each channel gives:
xˆ 2 (τ , ω )
R21 (τ , ω ) = (2.29)
xˆ1 (τ , ω )
Assuming the N sources are W-DO, it can then be seen that for any given source j active
at (τ, ω)
R21 (τ , ω ) = a j e
− iωδ j
(2.30)
and that:
1
− ∠R21 (τ , ω ) = δ j (2.32)
ω
where ∠z is the phase of the complex number z taken between -π and π. This allows the
mixing parameters to be calculated. A time-frequency mask for the first source can be
calculated by finding (τ, ω) which have mixing parameters a1 and δ1. This can then be
repeated for successive sources. The separated sources can be obtained by applying the
time-frequency mask to either of the two original mixture signals.
51
Music Information Retrieval Methods
An important limitation to the DUET method is that the time delay between the
two receivers is limited by the following condition:
ω max δ j max < π (2.33)
If ωmax = ωs/2 where ωs is the sampling frequency then the maximum delay between the
two receivers is:
2π
δ j max = (2.34)
ωs
In practice when using a two-microphone setup this means that the maximum distance
between the microphones is d = δjmaxc where c is the speed of sound. This means that the
distance between two microphones has to be quite small in practice, of the order of a few
centimeters.
However, in most cases the assumption of strict W-DO will not hold. Of greater
interest is the case where the sources are approximately W-DO. It has been shown that
mixtures of speech signals can be considered approximately W-DO [Rickard 02] and so
an algorithm was derived in [Yilmaz 02] that was capable of separating approximately
W-DO sources. As the sources are no longer strictly W-DO there will be cases where the
sounds interfere with each other. As a result the estimates for the parameters aj and δj for
each source will no longer be uniform. However, the estimates of the parameters for each
source will still have values close to the actual mixing parameters. Therefore, plotting a
smoothed 2-D weighted histogram of the values of the estimates of aj and δj and then
finding the main peaks of the histogram will give an estimate of the actual mixing
parameters. The time delay at each position in time and frequency is calculated from:
1
δ (τ , ω ) = − ∠R21 (τ , ω ) (2.35)
ω
For reasons explained in [65] the attenuation coefficients are not estimated directly, but
instead the following parameter is used:
α (τ , ω ) = a(τ , ω ) − 1 (2.36)
a (τ , ω )
where:
a(τ , ω ) = R21 (τ , ω ) (2.37)
52
Music Information Retrieval Methods
A 2-D weighted histogram is then calculated for δ(τ,ω) and α(τ,ω). The 2-D weighted
histogram is defined as:
h(α , δ ) = ∑ M α , β / 2 [τ , ω ]M δ ,∆ / 2 [τ , ω ] xˆ1 [τ , ω ]xˆ 2 [τ , ω ] (2.38)
τ ,ω
where β and ∆ are the resolution widths for α and δ respectively, and where M α , β / 2 [τ , ω ]
1 if α (τ , ω ) − α < β / 2
M α , β / 2 (τ , ω ) = (2.39)
0 otherwise
1 if δ (τ , ω ) − δ < ∆ / 2
M δ ,∆ / 2 (τ , ω ) = (2.40)
0 otherwise
To ensure that all the energy in the region of the true mixing parameters is gathered into a
single peak for each source the histogram is then smoothed with a rectangular kernel
r(α,δ) which is defined as:
1 / AD (α , δ ) ∈ [− A / 2, A / 2]× [− D / 2, D / 2]
r (α , δ ) = (2.41)
0 otherwise
(
L j [τ , ω ] = p xˆ1 (τ , ω )xˆ2 (τ , ω ) a j , δ j )
( )
1 −iδ jω 2
1 − 2σ 2 a j e xˆ1 [τ ,ω ]− xˆ2 [τ ,ω ] / 1+ a 2j (2.43)
= e
2πσ 2
The time-frequency points for a given source can then be obtained by taking all points for
which L j [τ , ω ] ≥ Li [τ , ω ], ∀i ≠ j as belonging to the jth source. A binary time-frequency
mask can then be constructed for the source. The mask can then be applied to the STFT
53
Music Information Retrieval Methods
of either of the mixture signals and the source re-synthesised via an inverse STFT of the
time-frequency representation obtained.
There still remains the problem of blindly estimating the number of sources
present in the mixtures. In testing, Rickard et al used an ad-hoc procedure that iteratively
selected the highest peak and removed a region surrounding the highest peak from the
histogram. Peaks were then removed until the histogram fell below a threshold
percentage of its original weight. However, both the threshold percentage and region
dimensions had to be altered from time to time in the test experiments described in [65],
leaving the identification of the number of sources present as an open issue.
The DUET algorithm was found to work well on anechoic mixtures of speech
signals. It was shown in [66] that the condition of approximate W-DO holds quite well
for anechoic speech mixtures, and so the DUET algorithm is well suited for separating
such mixtures. When tested under echoic conditions the performance of the algorithm
degraded considerably though some degree of separation was found to be possible. In
echoic conditions it was found that the histogram peak regions were spread out and
overlapped with each other, making the algorithm less effective.
When dealing with overlapping noise based sounds such as drums the assumption
of W-DO, or even approximate W-DO, cannot be held to be true across the entire
frequency spectrum. However, in many cases there will be regions of the frequency
spectrum where the assumption of approximate W-DO will hold true and the information
from these regions may be sufficient to allow estimation of the mixing parameters. For
example consider a hi-hat and bass drum occurring simultaneously. In this case the high
frequency region will mostly contain information on the hi-hat and the lower region will
contain information mainly relating to the bass drum. Even in a more ambiguous case
such as snare and bass drum overlapping there should still be regions in the spectrum
where one source predominates over the other so as to allow estimation of the mixing
parameters.
We carried out a number of tests to see how the DUET algorithm performed on
separating mixtures of drum sounds. In these tests the listener identified all sources
manually and the algorithm parameters were set to give the best results for each test
signal. The first test was a drum loop consisting of a bass drum and snare drum
54
Music Information Retrieval Methods
alternating with no overlap in time. The bass drum was panned mid-left and the snare
drum mid-right. As no overlap occurs in time the condition of W-DO is satisfied in this
case. Figure 2.3 shows the smoothed histogram obtained from applying the DUET
algorithm to this example. As can be seen there are two clearly defined peaks, one for
each of the sources present. Figure 2.4 shows the waveforms for the first mixture signal,
and the separated sources. The relevant audio examples for Figure 2.4 can be found in
Appendix 2 on the CD included with this thesis.
As can be seen in Figure 2.4, the sources have been separated successfully, which
is not surprising as all the conditions for the successful use of the algorithm have been
met. This demonstrates the use of the algorithm in separating mixtures when no overlap
in time occurs.
55
Music Information Retrieval Methods
56
Music Information Retrieval Methods
The second test was a drum loop consisting of bass drum, snare drum and hi-hats.
The snare was panned mid-left, the hi-hats mid-right and the bass drum was panned to
center. The hi-hats overlapped all occurrences of snare and bass drum, and no overlap
occurred between snare and bass drum. This is a more realistic test than the first test as
hi-hats often overlap occurrences of these drums and the drum pattern in the loop is one
of the most common patterns that occur in rock and pop music. In this case the condition
of approximate W-DO is violated, but as noted previously there will be regions of the
spectrum where the condition of approximate W-DO still holds and this should be
sufficient to allow estimation of the mixing parameters.
As can be seen from in Figure 2.5 above, the assumption that there are enough
regions in the spectrogram where approximate W-DO applies to allow estimation of the
mixing parameters has turned out to be true in this case. There are visible peaks
associated with each of the sources. However, it should be noted that these peaks are not
as clearly defined as in cases where approximate W-DO holds true across the entire range
of the spectrum, thus implying a degradation in the performance of the algorithm.
57
Music Information Retrieval Methods
The re-synthesised signals shown in Figure 2.6 (see Appendix 2 for audio
examples) show that there is sufficient information to recover good estimates of both the
snare and hi-hat, while the bass drum still contains portions of the snare and hi-hats. The
presence of these sources has been reduced somewhat by the algorithm. On listening to
the re-synthesised sounds it is noted that the snare has been captured quite well, as have
most of the hi-hats, the exceptions being the hi-hats which occurred simultaneously with
the snare drum.
58
Music Information Retrieval Methods
example, showing that there is less information available to allow separation to take
place. Nevertheless the algorithm still performed well in difficult circumstances, dealing
with the case of overlapping snare and bass drum reasonably well.
59
Music Information Retrieval Methods
Despite this there are a number of problems with using the DUET algorithm. The
first of these is that the thresholds have to be adjusted manually to detect the peaks in
order to avoid detection of spurious peaks. This leads to a lack of robustness when
attempting to transcribe and separate drum sounds automatically.
The second and more important problem is inherent in the DUET algorithm itself.
The algorithm makes use of differences in amplitude and phase between the sources in
order to separate the sounds. If there are no amplitude and phase differences available
then the algorithm cannot work. Consider then a typical recording setup for a drum kit.
Typically there will be individual directional microphones close to each of the sources,
for example individual microphones close to the snare, kick drum, hi-hats, and so on.
There may be a pair of microphones placed high over the drum kit to also capture an
overall recording of the drum kit. These microphones will be mixed into the background
of the mix to add extra ambience to the overall recording, though for the most part the
principal source for each of the drums will be the microphone closest to the drum in
question. It should also be noted that the pair of overhead microphones will be too far
apart to allow correct estimation of the phase parameter for the DUET algorithm. This
effectively results in a set of mono recordings, one for each drum, which will then be
mixed down to a stereo signal. These mono signals can then be panned to various
positions in the stereo field. This will result in a stereo signal composed of a mixture of
mono signals, and as such there is no phase difference available to measure between the
signals for each source.
In such a case, where there is no phase difference available, the DUET algorithm
can then be simplified considerably. The algorithm can be reduced to finding only
estimates of the amplitude parameter. As a result, a 1-D histogram of the amplitude
parameter will be sufficient to differentiate between the sources, provided of course that
each of the sources is panned to a different point in the stereo field. In practice this is not
the case, as both snare and kick drum are usually panned to the center of the stereo field,
making it impossible to separate them using the DUET algorithm. The hi-hats and ride
cymbals are often panned slightly to the left or the right, which means that separation of
these drums will often be possible. The toms may take up various positions in the stereo
60
Music Information Retrieval Methods
spectrum depending on the context and may change position in the stereo spectrum in the
course of a piece, for example a drum roll which pans from left to right as it progresses.
The addition of reverb to the drum sounds will further complicate matters, making it
more difficult for the DUET algorithm to perform successfully.
As a result of the above it cannot be guaranteed that each drum will have a unique
position in the stereo field and so, while effective in separating drum sounds when this is
the case, the DUET algorithm cannot be relied upon to work in all cases. It should be
noted that the algorithm is still of potential use in dealing with drum sounds that are
known to be panned to a unique position in the stereo field. More recently extensions
have been proposed to the DUET algorithm that attempt to extend its usefulness when
dealing with musical signals [Viste 02], [Masters 03]. Unfortunately these systems still
have the same limitations as the original DUET algorithm as described above, namely the
necessity for a unique position in the stereo field for each source, and the problem of
detecting the correct number of sources in the signal. Nonetheless, the concept of W-DO
has been shown to be an applicable method for attempting separation of mixtures of drum
sounds, and a method of obtaining or approximating binary masks from a single channel
mixture would extend the usefulness of the algorithm for the separation of drum sounds.
2.7 Conclusions
The literature review conducted in this chapter dealt with Music Information Retrieval
methods and techniques and how they may be of use in tackling the problem of automatic
drum transcription. As can be seen, very little work has been done that focuses on the
problem of polyphonic drum transcription, and in most cases no proper evaluation of the
methods proposed has been carried out. With the exception of recent work by Paulus, the
methods all fail to adequately address the problem of dealing with mixtures of drum
sounds, and this is something that needs to be overcome to develop robust drum
transcription systems. Nevertheless, the methodologies used are a good starting point for
drum transcription systems, and techniques such as those used for onset detection and
removing the effects of pitched instruments via sinusoidal modelling are of potential use.
Beat tracking methods have been investigated as a potential means of resolving
ambiguities in the transcription process, though it is possible to build drum transcription
61
Music Information Retrieval Methods
systems that make no use of beat tracking. As was noted, sinusoidal modelling techniques
can provide a means of removing some of the interference due to pitched instruments
when attempting to transcribe drums in the presence of other instruments.
While much work has been done on musical instrument identification systems,
and while there are systems which can robustly identify the type of drum presented to the
system, as of yet these systems focus on identifying single instruments. Again, dealing
with mixtures of drum sounds is a problem that needs to be dealt with.
Polyphonic Music Transcription was briefly looked at to investigate various
methods of integrating psychoacoustic knowledge into transcription systems, and a
number of sound source separation schemes were investigated. Of these, the DUET
algorithm is potentially the most useful, but is limited by the necessity of requiring
different pan and delay positions for each source, which cannot always be guaranteed.
As can be seen, the various Music Information Retrieval methods provide
directions for investigation in the problem of drum transcription. The most promising of
these will be considered again in chapter 4, alongside information theoretic approaches,
with a view to finding a suitable approach to attempt automatic drum transcription.
62
Information Theoretic Approaches to Computational Audition
63
Information Theoretic Approaches to Computational Audition
64
Information Theoretic Approaches to Computational Audition
information theoretic approach used successfully grouped the sinusoids without being
told the number of groups present, or indeed given any other information other than the
sinusoids themselves.
Further supporting evidence for the use of redundancy reduction techniques can
be found in the work of Casey, who has demonstrated the use of redundancy reduction
techniques for sound recognition and sound source separation [Casey 98], [Casey 00].
Since then, more evidence for a statistical based approach to audition was provided by
Chechik et al. who investigated the way groups of auditory neurons interacted, and found
evidence of redundancy reduction between groups of auditory neurons containing ten or
more cells [Chechik 01]. This work provided, for the first time, direct evidence for
redundancy reduction in the auditory pathway.
As a result of the success of information theoretic approaches in dealing with
aspects of vision and more recently audition, it was decided to investigate these
approaches to see if they could be of use in attempting to build a system to transcribe
drums. The first two sections of this chapter provide background on two of the tools most
commonly used when applying information theoretic approaches, Principal Component
Analysis, and Independent Component Analysis. This is then followed by a review of
approaches to extracting information from single channel audio sources using these
techniques, such as methods for extracting invariants capable of describing sound sources
and methods for sound source separation such as Independent Subspace Analysis. The
use of Sparse Coding techniques for extracting information from single channel audio
signals is then discussed. Also included are potentially useful information theoretic and
redundancy reduction techniques from other fields such as vision research.
65
Information Theoretic Approaches to Computational Audition
was first carried out by Stautner in 1983, who used it to analyse a performance of tabla
playing [Stautner 83]. It has also been used as a means of analysing and manipulating
musical sounds [Ward 02].
Two random variables x1 and x2 are said to be uncorrelated or orthogonal if:
E{x1 , x 2 } − E{x1 }E{x 2 } = 0 (3.1)
where E{x} is the expectation of the variable x.
The first principal component contains the largest amount of the total variance as
possible, and each successive principal component contains as much of the total
remaining variance as possible. As a result of this property, one of the uses of PCA is as a
method of dimensional reduction, through the discarding of components that contribute
minimal variance to the overall data.
66
Information Theoretic Approaches to Computational Audition
67
Information Theoretic Approaches to Computational Audition
columns of V contain the principal components of Y based on time. Figure 3.1 shows the
spectrogram of a drum loop containing snare, bass drum and hi-hats. Figures 3.2 and 3.3
show the first three columns of U and V, respectively, which were obtained by carrying
out PCA on the drum loop shown in the spectrogram.
68
Information Theoretic Approaches to Computational Audition
hats begins to be seen clearly. Thus, care is needed in choosing the number of principal
components to be retained when using PCA for dimensional reduction purposes.
Also of importance is the fact that while PCA is capable of characterising the
overall signal, it has not successfully characterised the individual sound sources present
in the spectrogram. This is immediately apparent in the first set of principal components.
The first frequency component (shown in Figure 3.2) contains two main peaks,
corresponding to the main resonances of the kick drum and snare drum respectively, and
also contains some high frequency information relating to the hi-hats. This is further
borne out in the first time component (Figure 3.3), where each event can be seen.
The main reason PCA has failed to characterise the individual sound sources is
that it only de-correlates the input data, which by definition makes the sources
independent up to second order statistics only. Statistical independence between two
variables is defined as:
p (x1, x 2 ) = p( x1 ) p ( x 2 ) (3.3)
A consequence of this is that, given two functions f(x) and g(x), for independent variables
E{ f ( x1 )g ( x2 )}− E{ f ( x1 )}E{g ( x2 )} = 0 (3.4)
In the case where f(x1) = x1 and g(x2) = x2 this reduces to equation 3.1. Therefore,
independence implies decorrelation, but the reverse does not hold. For Gaussian
distributions decorrelation and independence are equivalent as Gaussian distributions do
not contain information above second order. From this it can be seen that PCA carries
within it the implicit assumption that the variables or sources in the data can be
represented by Gaussian distributions. However, Bell and Sejnowski have shown that
musical sounds have probability density functions (pdfs) which are highly non-Gaussian
and contain information in higher order statistics than second order [Bell 95].
This is demonstrated in Figures 3.4 and 3.5 below which show pdfs obtained from
an excerpt of a pop song in comparison with that of the standard Gaussian pdf. The
musical signal shows a more pronounced ‘spike’ than the Gaussian pdf, and has longer
tails. The ‘spikeness’ of a given pdf is normally measured by fourth order statistics of the
signal, also known as kurtosis. Musical signals belong to a class of signals known as
super-Gaussian. These types of signals exhibit larger spikes in their pdfs than the
Gaussian distribution and have longer tails than Gaussian distributions.
69
Information Theoretic Approaches to Computational Audition
70
Information Theoretic Approaches to Computational Audition
More extreme spikeness can be seen if the signal is transformed to the frequency domain.
The pdf, which has had its mean removed, has also become pronouncedly skewed to one
side, and has a very long tail, though this is not visible in the region plotted.
Therefore, PCA is incapable of adequately characterising musical signals as it
assumes a Gaussian model, which clearly does not describe musical signals which are
highly non-Gaussian, and so does not take into account information contained in the
higher order statistics of musical signals. Full statistical independence can only be
achieved with reference to these higher order statisitics, and achieving statistical
independence for non-Gaussian sources has been studied extensively and has resulted in a
set of techniques known as Independent Component Analysis [Comon 94].
Despite the fact that it only characterises sources up to second order statisitics,
PCA is still a useful tool particularly for dimensional reduction, because it orders its
components by decreasing variance, thereby allowing components of low variance to be
discarded.
It is assumed that the contributions of each of the N sources add together linearly to
create each xi. Using matrix notation, the equation can be written in a more elegant form:
71
Information Theoretic Approaches to Computational Audition
x = As (3.6)
with:
x T = [x1 K x M ] ,
s T = [s1 K s N ]
and A is an M x N invertible matrix, called the mixing matrix. In most ICA algorithms
the number of sensors has to equal the number of sources, resulting in A being of size N x
N.
ICA then attempts to estimate the matrix A, or, equivalently, to find an unmixing
matrix W such that
y = Wx = WAs (3.7)
gives an estimate of the original source signals where y = [y1…yN], and W is of size N x
N.
The matrix y will have independent components yi if and only if:
N
p(y ) = ∏ p( yi ) (3.8)
i =1
where p(yi) is the probability density function (pdf) of yi , and p(y) is the joint pdf of the
matrix y. As stated previously, an alternative definition of statistical independence which
highlights its relationship with decorrelation is:
E{ f ( x1 )g ( x2 )}− E{ f ( x1 )}E{g ( x2 )} = 0 (3.9)
As noted in the section on PCA, in the special case where x1 and x2 are random variables
with a Gaussian distribution then decorrelation and independence are equivalent as the
Gaussian distribution does not contain information in orders higher than second. It is this
fact that results in the requirement that the independent variables must be non-Gaussian
for ICA to work. Musical signals fit this criteria, being non-Gaussian in nature as
mentioned in the previous section.
ICA seeks to find an unmixing matrix W such that the resulting matrix y has
component pdfs that are factorisable in the manner shown in equation 3.8. It is possible to
obtain such an unmixing matrix given two constraints, these being that we cannot recover
the source signals in the order in which they came in, and we cannot get the original
signals in their original amplitude. The amplitudes cannot be recovered because, as both
A and s are unknown, multiplying any of the sources si with a scalar could always be
72
Information Theoretic Approaches to Computational Audition
Several different criteria have been proposed as a basis for obtaining objective functions
for ICA. These include mutual information, negentropy, maximum likelihood estimation
and information maximisation (‘Infomax’). It has been demonstrated that all these criteria
can be viewed as variations on the theme of minimising the mutual information between
output components [Cardoso 97], [Hyvärinen 99a], [Lee 00], and this equivalence is
summarised below.
Mutual information is a measure of the interdependence of random variables. It is
always non-negative and is zero if and only if the random variables are independent.
Defining mutual information between a set of m random variables on the basis of
differential entropy gives:
73
Information Theoretic Approaches to Computational Audition
m
I ( y1 , y2 ,..., ym ) = ∑ H ( yi ) − H (y ) (3.11)
i =1
where I(y1,y2,…,ym) is the mutual information and where the differential entropy H(y) is
defined as:
H ( y ) = − ∫ p ( y ) log{p( y )}d ( y ) (3.12)
Then mutual information defined using the Kullback-Liebler distance between the joint-
probability density function p(y) and the product of the pdfs of its components is given
by:
N
p (y )
δ p (y ), ∏ p( y i ) = ∫ p(y ) log N
d (y ) (3.15)
i =1
∏ p( y )
i =1
i
If a model pdf is assumed to approximate the unknown pdfs of the outputs yi,then
minimising mutual information can be viewed as minimising the distance between the
observed output y and the estimated pdfs of the outputs. An alternative way of looking at
this is that the likelihood that the output y was generated from the model pdfs associated
with the outputs yi has been maximised. In other words, minimising mutual information is
equivalent to maximum likelihood estimation when model pdfs have been assumed for
the outputs.
Another useful quantity related to mutual information is negentropy. This
quantity makes use of the fact that a Gaussian variable has the largest entropy among all
random variables of equal variance. Negentropy is always non-negative and is zero only
for a Gaussian variable. Negentropy is defined as:
74
Information Theoretic Approaches to Computational Audition
J (y ) = H (y gauss ) − H (y ) (3.16)
1 ∏ Vii
N y
I (y ) = J (y ) − ∑ J ( yi ) + (3.17)
i =1 2 V
where V is the variance of y [Comon 94]. If the yi are uncorrelated the third term becomes
zero. Comon shows that J(y) equals J(x) as differential entropy is invariant by an
orthogonal change in the coordinates and is therefore a constant which can be ignored
N
[Comon 94]. From this it can be seen that I (y ) ≈ −∑ J ( yi ) . Therefore, minimising
i =1
It can then be clearly seen that maximising the output entropy involves maximising the
marginal entropies H(yi) and minimising the mutual information I(y1,…,ym). As mutual
information is always non-negative, and H(y) is constant for the data set in question then
maximising the output entropy will minimise the mutual information, with the output
entropy at a maximum when the mutual information is zero [Lee 00]. Thus, carrying out
information maximisation is equivalent to minimising mutual information.
Having shown that all methods for obtaining objective functions to estimate ICA
all equate to minimising mutual information, there still remains the problem of defining
the objective functions themselves to create algorithms to enable ICA. The principle
problem lies in estimating the quantity of mutual information itself. It is difficult to
estimate this quantity directly, requiring as it does an estimate of the pdfs of the
independent components, and various approximations to mutual information have been
proposed. There are two main approaches to approximating mutual information. The first
is to assume that the pdfs of the independent components can be approximated by some
suitable function. This function can then be used to estimate the mutual information. This
75
Information Theoretic Approaches to Computational Audition
is the approach taken by Bell and Sejnowski and Hyvärinen [Bell 95], [Hyvärinen 99].
The second approach is to approximate mutual information by some properties of the
data itself such as its cumulants. This is the approach taken by Comon in [Comon 94]. A
further advance on this was the incorporation of kernel estimation methods for the pdfs
by Bach [Bach 02].
It is important to note that in many cases these various algorithms and approaches
will arrive at more or less the same solution. The various approaches used have been
shown to be essentially the same, and it has been observed that the source separation
ability of the algorithms is not particularly sensitive to the approximations to the pdfs
used. In other words, the algorithms can tolerate a fair amount of mismatch between the
assumed pdfs and the actual pdfs and still achieve good separation [Cardoso 97]. As a
result, many of the algorithms achieve essentially the same results in most cases, while
each algorithm outperforms the others in their own niche areas, which tend to be on the
extreme bounds of the limits of the algorithms.
The Infomax algorithm, derived by Bell and Sejnowski [Bell 95], uses the concept of
maximising the information or entropy of the output matrix as a means to obtain the
unmixing matrix W. The joint distribution of the output y is given by Y = {Y1,…,YN} =
{σ1(y1),…, σN(yN)}, where yi =Wxi, and σ(yi) is the cumulative density function (cdf) of
the output signal. In matrix formulation this becomes Y = σ(Wx). The entropy of a given
signal y with cdf σ(y) is given by:
H ( y ) = −E[ln σ ′( y )] = −∫ σ ′( y ) logσ ′( y )d ( y ) (3.19)
where σ ' ( y ) is the derivative of σ(y). The derivative of σ(y) is the probability density
function of y and so equation 3.19 is equivalent to equation 3.12.
The output entropy H(Y) is related to the input entropy H(x) by:
H (Y ) = H (x ) + E ln J (3.20)
where |J| is the determinant of the Jacobian matrix J=δY/δx. The input entropy H(x) is a
constant and can be ignored in the maximisation problem. Using the chain rule, |J| can be
evaluated as:
76
Information Theoretic Approaches to Computational Audition
N
J = ∏ σ ′ ( y )W
i =1
i i (3.21)
Ignoring H(x) and substituting for |J| gives a new function that differs from H(Y) by a
constant equal to H(x):
N
h (W ) = E ∑ log σ i′ ( y i ) + log W (3.22)
i =1
The expectation term can be estimated from the data in the signal as follows:
N 1 n N
E ∑ log σ i′( yi ) ≈ ∑∑ log σ i′ yi( j ) ( ) (3.23)
i =1 n j =1 i =1
Substituting 3.23 into 3.22 then gives:
h(W ) =
1 n N
∑∑
n j =1 i =1
( )
log σ i′ yi( j ) + log W (3.24)
Taking the derivative with respect to W yields the following update equation:
δσ ′ T
∆W ∝ W T [ ] −1
+
δy
x (3.25)
Two commonly used functions for approximating the cdfs of superGaussian signals are
the ‘logistic’ function:
1 δσ ′
σ (y) = giving = (1 − 2 y ) (3.26)
(
1 − e− y ) δy
and the hyperbolic tangent function:
δσ ′
σ ( y ) = tanh ( y ) giving = −2 y (3.27)
δy
If , however, a reliable estimate of the pdfs of the sources is known then the relevant pdf
can be used to improve separation.
The convergence of the Infomax algorithm was found to improve by using the
natural gradient as proposed by Amari [Amari 98]. This amounts to multiplying the right
hand side of the update equation by WTW to give:
δσ ′ T
∆W ∝ I + y W (3.28)
δy
where I is the identity matrix. This update equation converges faster, eliminating the need
for carrying out a matrix inversion at each iteration.
77
Information Theoretic Approaches to Computational Audition
The approach to ICA as formulated by Comon [Comon 94] uses mutual information as a
basis for ICA. Mutual Information for this algorithm was formulated in terms of
negentropy. As previously noted in equation 3.17, the relationship between mutual
information and negentropy is given by:
1 ∏ Vii
N y
I (y ) = J (y ) − ∑ J ( y i ) + (3.29)
i =1 2 V
where V is the variance of y. If the yi are uncorrelated, the third term reduces to zero, and
as differential entropy is invariant under an orthogonal change in the coordinates, then
N
maximising I(y) equates to minimising − ∑ J ( yi ) . As the pdfs of the sources are
i =1
unknown, Comon approximates the pdfs using Edgeworth expansions of the data. The
Edgeworth expansion [Kendall 87] of the pdf of a vector y up to terms of order 4 about its
best Gaussian approximation (assumed here to have zero mean and unit variance) is
given by :
p( y ) 1 1 10 1 35 280 3
= 1 + k 3 h3 ( y ) + k 4 h4 ( y ) + k 32 h6 ( y ) + k 5 h5 ( y ) + k 3 k 4 h7 ( y ) + k 3 h9 ( y )
φ (y) 3! 4! 6! 5! 7! 9!
1
6!
56
8!
35
+ k 6 h6 ( y ) + k 3 k 5 h8 ( y ) + k 42 h8 ( y ) +
8!
2100 2
10!
k 3 k 4 h10 ( y ) +
15400 4
12!
k 3 h12 ( y ) + o m −2 ( )
where p(y) is the pdf of y, ki is the ith order cumulant of y, and hi(y) is the Hermite
polynomial of degree i. The Hermite polynomials are defined in a recursive manner as
follows:
δ
h0 ( y ) = 1, h1 ( y ) = y hk +1 ( y ) = uhk ( y ) − hk ( y ) (3.30)
δu
The Edgeworth expansion was then used to derive an approximation to mutual
information of a vector:
I ( y ) ≈ J (y ) −
1 N
∑
48 i =1
{4k 32 + k 42 + 7k 34 − 6k 32 k 3 } (3.31)
78
Information Theoretic Approaches to Computational Audition
{ }
N
minimised reduces to − ∑ k 42 . This is the function used by Comon in his ICA algorithm
i =1
[4]. This function optimises using the fourth order cumulant only and this cumulant is
commonly referred to as kurtosis.
It should be noted that the Edgeworth expansion is only valid if the source pdf is
close to Gaussian, and will lead to poor approximation if this is not the case. Also, it has
been found that kurtosis is very sensitive to outliers in the data [Huber 85]. Its value may
depend on a small number of values in the tail of the pdf. These values may be erroneous,
and as a result, optimisation using kurtosis may not be robust in some circumstances.
The FastICA algorithm is a fixed-point algorithm for carrying out ICA [Hyvärinen 99]. It
is based on the use of negentropy as a cost function. As stated previously, negentropy is
defined as:
J ( y ) = H ( y gauss ) − H ( y ) (3.32)
G1 ( y ) =
1
a1
log cosh a1 y (
G2 ( y ) = − exp − a 2 y 2 / 2 ) (3.34)
79
Information Theoretic Approaches to Computational Audition
4. Decorrelate the output to prevent the vectors from converging to the same
Before running the algorithm the mixture signals are first whitened, in other
words, the signals are uncorrelated and their variances normalised to unity. This has the
effect of reducing the number of parameters to be estimated in the ICA.
FastICA, as its name suggests, runs faster than other ICA algorithms when
dealing with large batches of data without any compromises in the performance of the
algorithm. Indeed, in some circumstances it is more robust than the Infomax and kurtosis
based approaches described above.
Having discussed in detail ICA, its limitations and methods for the creation of ICA
algorithms, applications of ICA will now be discussed. While ICA has found many and
varied uses in fields such as analysing EEG data and image analysis, it is its application
to problems in audio that is of concern in this case.
80
Information Theoretic Approaches to Computational Audition
In the field of audio, its main application to date has been in unmixing sound
sources recorded in a single room on a number of microphones. As an example, consider
the case where there are a number of musicians playing in a room. Each musician can be
considered a source of sound. To record the musicians playing a number of microphones
are positioned around the room. Each of the microphones will record different mixtures
of the sources, depending on where they are positioned in the room. We have no prior
information regarding the instruments the musicians are playing, or indeed about what
they are playing. Each of the sound sources can be considered to be independent and non-
Gaussian, and so the ICA model can be applied in an attempt to obtain the clean unmixed
sources.
However, the standard ICA model can be at best considered a simplification of
the real world situation. The microphones will also record reverberations of the other
sources from the walls of the surrounding room, in effect creating convolved mixtures of
the sounds instead of the linear mixing assumed by ICA. Partial solutions to this have
been proposed by Smargadris and Westner [Smaragadis 97], [Westner99]. However,
much work remains to be done before robust solutions to this problem can be arrived at.
Another major limitation on the use of ICA in audio remains the fact that, in
general, most ICA algorithms require the use of as many sensors as there are sources. In
many cases, such as when attempting to transcribe drums from a mixture signal, there
will usually be only one or two channels available, depending on whether the recording is
in mono or stereo. Work has been carried out attempting to deal with cases where there
are more sources than sensors, and algorithms have been proposed that deal with
obtaining three sources from two sensors [DeLathauwer 99], but further research is
needed in this area.
With regards to work on drum sounds using ICA, Riskedal has used ICA to
unmix mixtures of two drums where two channels are available [Riskedal 02]. This
amounts to no more than an application of the standard ICA model in a two source, two
sensor model. Unfortunately, this approach to separating drum sounds suffers from the
sensors to sources limitation of standard ICA, and no attempt was made to extend this
approach to deal with more sources than sensors.
81
Information Theoretic Approaches to Computational Audition
Independent Subspace Analysis (ISA) was first proposed by Casey and Westner as a
means of sound source separation from single channel mixtures [Casey 00]. It is the only
technique presented in this chapter which was specifically created to deal with audio.
However, the techniques it uses are generally applicable to other problems, and indeed
similar ideas have been used previously in image research. ISA is based on the concept of
reducing redundancy in time-frequency representations of signals, and represents sound
sources as low dimensional subspaces in the time-frequency plane. ISA takes advantage
of the dimensional reduction properties of PCA and the statistical independence
achievable using ICA by using PCA to reduce a spectrogram to its most important
components, and then using ICA to make these components independent.
ISA arose out of Casey’s work in trying to create a signal representation capable
of characterising everyday sounds such as a coin hitting the floor used as sound effects
for film, TV and video games. Casey wanted to find invariants that characterised each
sound to allow their identification and to allow further manipulation of the sounds [Casey
98]. The method used involves carrying out PCA followed by ICA on a time-frequency
representation of the sound. This recovered a set of independent features for each sound
which could then be used to identify the sound and to allow further manipulation of the
sound. This method has since been incorporated into the MPEG 7 specification for
classification of individual sounds [Casey 02]. ISA emerged as an extension of this
technique to the problem of multiple sound sources playing simultaneously.
ISA makes a number of assumptions about the nature of the signal and the sound
sources present in the signal. The first of these is that the single channel sound mixture
signal is assumed to be a sum of p unknown independent sources,
p
s (t ) = ∑ s q (t ) (3.35)
q =1
Carrying out a Short-Time Fourier Transform (STFT) on the signal and using the
magnitudes of the coefficients obtained yields a spectrogram of the signal, Y of
dimension n × m, where n is the number of frequency channels, and m is the number of
time slices. From this it can be seen that each column of Y contains a vector which
represents the frequency spectrum at time j, with 1 ≤ j ≤ m. Similarly, each row can be
82
Information Theoretic Approaches to Computational Audition
seen as the evolution of frequency channel k over time, with 1 ≤ k ≤ n. The motivation for
using a magnitude spectrogram is that the system is trying to capture perceptually salient
information, and this information is not observable when using the complex valued
STFT.
It is assumed that the overall spectrogram Y results from the superposition of l
unknown independent spectrograms Yj. Further, it is assumed that the superposition of
spectrograms is a linear operation in the time-frequency plane. While this is only true if
the underlying spectrograms do not overlap in time and frequency, it is still a useful
approximation in many cases. This yields:
l
Y = ∑Y j (3.36)
j =1
It is then assumed that each of the Yj can be uniquely represented by the outer
product of an invariant frequency basis function fj, and a corresponding invariant
amplitude envelope or weighting function tj which describes the variations in amplitude
of the frequency basis function over time. This yields
l
Y = ∑ f j t Tj (3.37)
j =1
83
Information Theoretic Approaches to Computational Audition
Figure 3.6. Spectrogram of a snare and plots of its associated basis functions
The independent basis functions correspond to features of the independent
sources, and each source is composed of a number of these independent basis functions.
The basis functions that compose a sound source form a low-dimensional subspace that
represents the source. Once these have been identified, the independent sources can be re-
synthesised if required. As Casey has noted, “the utility of this method is greatest when
the independent basis functions correspond to individual sources in a mixture” [Casey
00]. In other words, ISA works best when each individual component corresponds to a
single source.
To decompose the spectrogram in the manner described above, PCA is performed
on the spectrogram Y. This is carried out using the SVD method and yields:
Y = USV T (3.39)
as described in section 3.1. As the sound sources are assumed to be low-dimensional
subspaces in the time-frequency plane, dimensional reduction is carried out by discarding
components of low variance. If the first l principal components are retained then equation
3.39 can be rewritten as:
84
Information Theoretic Approaches to Computational Audition
l
Y ≈ ∑ u j s j v Tj (3.40)
j =1
By letting ujsj equal hj and vj equal zj it can be seen that the spectrogram has been
decomposed into a sum of outer products as described in equation 3.37. In matrix
notation this becomes
Y ≈ hz T (3.41)
However, PCA does not return a set of statistically independent basis functions. It
only returns uncorrelated basis functions and so the spectrograms obtained from the PCA
basis functions will not be independent. To achieve the necessary statistical
independence, ICA is carried out on the l components retained from the PCA step. This
independence can be obtained on a frequency basis or a time basis. For the remainder of
this derivation, it will be assumed that independence in frequency is required.
Independence in time of the spectrograms can be derived in a similar manner. Carrying
out independent component analysis on h to obtain basis functions independent in
frequency yields:
f = Wh (3.42)
where f contains the independent frequency basis functions and W is the unmixing
matrix. At this point the associated amplitude basis functions can be obtained by
multiplying the spectrogram Y by the pseudo-inverse, fp, of the frequency basis functions
f. This yields:
t = fp Y (3.43)
85
Information Theoretic Approaches to Computational Audition
spectrogram inversion described by Griffin and Lim [Griffin 84], or its extension by
Slaney [Slaney 96].
The algorithm proposed by Griffin and Lim attempts to find an STFT whose
spectrogram is closest to the specified spectrogram in a least squares sense. The
algorithm proceeds by taking any given set of phase information and applying it to the
magnitude spectrogram. Each frame is then inverted using an IFFT, and weighted by an
error minimising window as shown below:
∞ ∞
x(n ) = ∑ y (mH − n )ws (mH − n ) / ∑ ws2 (mH − n ) (3.45)
−∞ −∞
where y is the IFFT of the current frame being inverted and ws is an error-minimising
window given by:
H
2πn π
ws (n ) = 2 L
a + b cos L + L (3.46)
2 2
4 a + 2b
H and L are the STFT hopsize and window lengths respectively and a = 0.54 and b = -
0.46 are the Hamming window coefficients.
An STFT is then taken of the resultant signal. The phase information obtained
from this STFT is then applied to the original magnitude spectrogram and the
spectrogram is inverted again using the error minimising window. This process then
proceeds iteratively until the total magnitude error between the desired magnitude
spectrogram and the estimated magnitude spectrogram falls below a set threshold. This
approach does not guarantee that a magnitude spectrogram inverted using this algorithm
will have the same waveform as the original waveform, merely that the spectral error is
minimised at each step. Convergence can be improved by using an appropriate choice for
the initial phase estimates. In this case, the phase of the original overall spectrogram is a
good starting point.
Slaney improved upon this approach by adding an idea from the SOLA time-
stretching algorithm, namely using cross-correlation to find the best time delay to overlap
and add a new window of data to the data already calculated [Roucos 85]. This further
improved the speed of convergence of the algorithm.
Having shown how ISA attempts to decompose a spectrogram, there still remains
an issue of importance to deal with, namely estimating how much information to retain
86
Information Theoretic Approaches to Computational Audition
from the dimensional reduction step. This is of vital importance in obtaining the optimal
sound source separation. Keeping too few components may result in the incorrect
separation of the sources, while keeping too many components can result in features
which cannot be identified as belonging to a given source. This is discussed in greater
detail below.
The amount of information contained in a given number of basis functions can be
estimated from the normalised cumulative sum of the singular values obtained when
carrying out the SVD. A threshold can then be set for the amount of information to be
retained, and the following inequality can be used to solve for the number of basis
functions required:
1
∑
ρ
σi ≥ φ (3.47)
∑i=1σ i
n i =1
where σi is the singular value of the ith basis function, φ is the threshold,ρ is the required
number of basis functions and n is the number of variates.
There is a trade-off between the amount of information to retain and the
recognisability of the resulting features. Setting φ =1 results in a set of basis functions
which support a small region in the frequency range. When φ << 1, the basis functions
are recognisable spectral features with support across the entire frequency range.
As an example of the application of ISA to a single channel mixture of sources,
consider the following spectrogram (Figure 3.7 – see Appendix 2 for audio) of an audio
excerpt taken from a commercially available CD. The excerpt consists of hi-hat, snare,
and piano, with the piano playing the same chord throughout the excerpt, so that the
stationary pitch assumption is not violated. The piano can be seen as the horizontal lines
visible in the lower part of the spectrogram, the hi-hats as the events with a broad
resonance in the upper part of the spectrogram, and the snare shows up as a noise burst
across the entire frequency spectrum. The spectrogram was passed through the ISA
algorithm, with three basis functions being retained from the PCA step. The time basis
vectors and frequency basis vectors are shown in Figures 3.8 and 3.9.
87
Information Theoretic Approaches to Computational Audition
As can be seen from Figure 3.8 (see Appendix 2 for resynthesised sources) below,
the amplitude envelopes of each of the sources has been successfully captured by the time
basis function. The first time basis corresponds to the snare, with some very small peaks
corresponding to some hi-hat information. The second function clearly captures the four
piano chords played in the excerpt. However, there is some noticeable jitter in the piano
event which coincides with the snare event. This appears to be as a result of the snare
masking some of the piano information, making it difficult to track the piano source at
that moment in time. The third basis function captures the five hi-hat events. It is worth
noting the wide difference in amplitudes between the hi-hat events, this variation is quite
common in drumming and does much to add the “groove” to the drum patterns played. It
should be noted that the basis functions in Figure 3.5 have been shifted and normalised to
ensure that there are no negative values in the basis functions. The basis functions were
shifted as the negative values, which occur as a result of ICA, are physically implausible.
88
Information Theoretic Approaches to Computational Audition
89
Information Theoretic Approaches to Computational Audition
synthesised snare signal. Nevertheless, despite some overlap between the sources, ISA
has made a good effort to separate the sources blindly without any prior knowledge of the
signal in question. This indicates its usefulness as a tool for sound source separation.
90
Information Theoretic Approaches to Computational Audition
where D is the average distortion error between a data vector zi and its corresponding
centroid yi , and M is an assignment matrix where Miv = 1 if the vector zi is assignable to
centroid vector yi, and Miv = 0 otherwise. The most commonly used distortion measure is
the squared Euclidian distance
91
Information Theoretic Approaches to Computational Audition
D ( z i , y i ) = z i − yi
2
(3.49)
However, in the case of ISA, the vectors recovered deal with different aspects of the
sources, be it in frequency or in time, and as a result these vectors contain important
information at different points in each vector, making clustering based on Euclidean
distances impractical in this case. This can be seen by observing the vectors recovered
from a drum loop in Figures 3.10 and 3.11. Casey and Westner avoid this by making use
of a probalistic distance metric as discussed below.
Another method of clustering vectors is to use pairwise clustering. In this case the
data is clustered according to the dissimilarity between pairs of vectors. A dissimilarity
matrix is calculated over all pairs of vectors using a suitable distortion measure. This is
the approach taken in the clustering algorithm employed by Casey. This algorithm was
proposed by Hofmann in [Hofmann 97] and was shown to be good at unsupervised image
segmentation.
The distortion measure used is the symmetric Kullback-Leibler distance. The
Kullback-Leibler distance is a measure of the distance between two pdfs, p(z) and q(z),
where z is a random variable or vector. It is given by:
p(z i ) ( )
p zj
( )
δ zi , z j =
1
2 ∫ p (z i ) log
p zj
dz +
( )
1
2 ∫ ( )
p z j log
p ( z )
dz (3.50)
i
In the case of ISA, the random variables or vectors are the recovered independent
subspaces from each of the blocks. The pdfs of the subspaces can be estimated using a
method such as an Edgeworth expansion [Kendall 87]. Casey then calculates the pairwise
Kullback-Leibler distance between all pdfs of the subspaces. The resulting matrix of
distances, termed an ixegram, has the following structure:
δ ( z1 , z1 ) δ ( z1 , z 2 ) L δ ( z1 , z n )
δ ( z , z ) δ (z , z ) L δ (z 2 , z1 )
D= 2 1 2 2
(3.51)
M M O M
δ (z n , z1 ) δ ( z n , z 2 ) L δ ( z n1 , z n )
The ixegram is a square symmetric matrix with all the diagonal elements equal to zero.
The cost function used by Hofmann to cluster the groups is:
92
Information Theoretic Approaches to Computational Audition
K n n
1
H (M ; D ) = ∑ ∑∑ M M kc Dik (3.52)
∑ j =1 M jc
n ic
c =1 i =1 k =i
Clustering can be carried out directly using the cost function as given above, by
using a deterministic annealing algorithm as described by Hofmann. The clustering yields
groups of subspaces that can be combined to recover estimates of the non-stationary
sources.
The extension of ISA to deal with sources that are non-stationary in pitch
enhances the usefulness of ISA and allows its use in more general circumstances.
However, when we carried out tests using non-stationary ISA, it was found that there
were a number of problems with the extended method, and that, when dealing with drum
sounds only, standard ISA often provides better results. These problems with non-
stationary ISA are discussed in the following section, alongside other problems with the
ISA model.
Though ISA does provide an effective means of separating sound mixtures, it should be
noted that there are still several problems with the method. While the combination of
PCA and ICA to perform ISA makes use of the properties of each method to perform a
separation not achievable using either method in isolation, ISA does retain some of the
problems associated with each method, such as ICA’s indeterminacy with regards to
source ordering and scaling. It will also be shown below that the variance-based nature of
PCA inherently biases the analysis towards sources of high amplitude, which can make it
difficult to recover sources of low amplitude, which in the case of drums would typically
be the hi-hats and cymbals. It should also be noted that the separation achieved, while
good, is not perfect. In particular, when dealing with drum sounds which are, as already
93
Information Theoretic Approaches to Computational Audition
noted, broadband noise-based instruments, there will be regions of overlap between the
sounds, and as a result sometimes other drums show up as small peaks in the amplitude
envelopes of the separated drums.
In testing the ISA method, it was found that the number of basis functions
required to separate the drums varied depending on the frequency characteristics and
relative amplitudes of the drums. In input signals containing mixtures of three drums the
number of basis functions varied from 3-6 and using an arbitrary threshold method as
described above did not always result in the correct separation of the test signals. This
indeterminacy is as a result of the variance-based nature of PCA, which inherently biases
the analysis towards the loudest sounds in the spectrogram. In particular, sounds with low
amplitude relative to other sources in the spectrogram will require a larger number of
components to be detected using PCA. In the case of drum loops, this means that snare
and kick drum will usually be picked up in the first two principal components. On the
other hand, hi-hats or ride cymbals which often have low amplitudes relative to the snare
and kick drums, can sometimes require up to six components to be retained before they
can be detected. As the relative amplitudes of the sources will vary from signal to signal,
different numbers of components can be required, depending on circumstances. This
makes setting a fixed threshold difficult. This indeterminacy affects the robustness of any
drum transcription system using ISA.
The problem of estimating the required information is illustrated in Figures 3.10
and 3.11. The figures show the amplitude envelopes obtained from performing ISA on a
drum loop containing snare, kick drum and hi-hats. Figure 3.10 shows the result obtained
from keeping 4 basis functions, and Figure 3.11 shows the result obtained from keeping 5
basis functions. As can be seen, retaining an extra basis function allows the separation of
the hi-hats. However, it can also be seen that the actual hi-hat events are less clearly
identified than those of the snare and kick drum. As a consequence of this indeterminacy,
the presence of an observer is required to identify the correct number of basis functions
required for separation of the drums.
94
Information Theoretic Approaches to Computational Audition
95
Information Theoretic Approaches to Computational Audition
Keeping a larger number of components and clustering the components using the
clustering algorithm as implemented by Casey was found not to significantly improve the
robustness of the detection of the sources. This was found to be mainly as a result of the
fact that the clustering algorithm assumes that independent components with similar pdfs
belong to the same source. However, in practice this assumption can be shown to result in
incorrect clustering in many situations. Figure 3.12 shows the first 15 independent time
components obtained from carrying out ISA on a 1 bar drum loop. The drum loop
contained kick drums, snares and hi-hats. The kick drum occurred on beats 1 and 3 with
the snare on beats 2 and 4. The hi-hats occurred every half beat.
It is clearly visible that there are 4 components related to the kick drum and 6
related to the snare, with some of the other components containing information related to
the hi-hats, though the 8 hi-hat events are not clearly visible in any component, though
the hi-hat events that do not overlap with different drums are clearly visible in three of
the components. This again highlights the problem of the recovery of sources with low
relative amplitudes. It can be seen that the components for both snare and kick drum all
consist of two large peaks surrounded by low energy noise.
96
Information Theoretic Approaches to Computational Audition
With regards to calculating the pdfs of these components, the low energy noise
dominates with the peak values representing outliers which will be found in the tails of
the pdfs. As a result the pdfs of these components will be very similar. This is borne out
in Figure 3.13, which shows the resulting ixegram, with dark blue signifying pairs of
components which have highly similar pdfs, with green, yellow, orange and brown
signifying increasingly dissimilar pdfs. As a result, the clustering algorithm cannot
correctly classify the components to the correct sources. However, if there is a different
frequency of occurrence of kick drum and snare drum, the algorithm will stand a greater
chance of clustering the sources correctly as the pdf of the source that occurs more
frequently will now be less peaky than that of the other drum. Unfortunately, the drum
pattern used in the example shown in Figure 3.12 is a fairly common pattern in pop and
rock music drumming.
97
Information Theoretic Approaches to Computational Audition
It can be seen that a large number of the components consist of a single large peak with
most of the spectrum containing very little energy. Again all these components will have
very similar pdfs which will tend to be clustered together. However, some of these
components belong to the snare drum and some belong to the kick drum and so the
clustering algorithm will fail to cluster the components to the correct sources.
98
Information Theoretic Approaches to Computational Audition
likelihood that the ICA algorithm will be able to find components that vary independently
from one and other. In other words, the ICA algorithm needs enough evidence for
independent variation of the sources before it can begin to find independence correctly.
The method also has limitations on the number of sources it can recover, working
best on signals with less than five sources. This is as a result of the trade-off between the
proportion of information retained during dimensional reduction and the recognisability
of the resulting features. An increasing number of sources will require an increasing
number of components for detection of all the sources present, especially if some of the
sources are at low relative amplitude to other sources present. However, increasing the
number of sources can result in the recovered features only supporting a small region in
the frequency spectrum leading to a lack of recognisability of the features. As a result, the
fewer the number of sources the better the method will work.
Another limitation of ISA is that, due to constraints inherent in ICA, it is not
possible to recover the signals in the order in which they came in. This means that the
resulting subspaces have to be identified as a given source by some means such as their
frequency characteristics or amplitude envelopes after ISA has been completed. It should
be noted that all algorithms involving the use of ICA suffer from this problem.
Further limitations are exposed when applying non-stationary ISA. On top of the
limitations inherent in the method used to cluster the sources, which have been previously
discussed, it was found that when segmenting the signal into smaller sections the
separation obtained varies with the type of events in each section. For instance,
performing ISA on a section containing only hi-hats and bass drum results in the recovery
of a different hi-hat subspace to that obtained from a section containing only hi-hats and
snare drum. This causes further problems in the clustering step, again resulting in the
incorrect clustering of the subspaces and as a result incorrect re-synthesis of the separated
sources. As a result, performing simple ISA on a drum loop often gave better results than
carrying out non-stationary ISA. This result suggests that non-stationary ISA will work
best when the same instruments are present at the same relative amplitudes in each
section of the overall signal. This imposes a further limitation on the usefulness of non-
stationary ISA. This result can also be seen as a consequence of not giving the ICA stage
of the algorithm enough information on which to obtain the separation. This is because
99
Information Theoretic Approaches to Computational Audition
breaking the signal into smaller sections reduces the number of events to be separated
with consequences already discussed above in the context of standard ISA.
However, once these limitations are taken into account ISA provides an effective
means of separating mixtures of drums and of overcoming the problems encountered by
Sillanpää et al when trying to identify and transcribe mixtures of drum sounds [Sillanpää
00]. By using the statistics of the overall signal ISA presents a method of separating
sources by taking into account the properties of the signal as a whole, as opposed to
trying to model each event as a combination of drum sounds.
Another approach to linearly separating mixture signals into the underlying sources
makes use of the idea of sparse coding [Olshausen 96]. In sparse coding it is assumed that
only a few of the underlying sources will be active at any given instance. Therefore each
of the observed signals will be composed of only a few of the underlying sources. This
means that the pdfs of the sources will be highly peaked at zero and have heavy tails.
Pdfs of this nature are termed ‘sparse’ and are supergaussian by definition.
The sparse coding model is similar to that of ICA, but with the addition of an
error term:
x = As + e (3.54)
where x contains the observed data, A contains the mixing matrix, s contains the
underlying sources, and e is an error term. In other words, sparse coding does not attempt
to perfectly reconstruct the data, but attempts to find a set of sparse sources which can
approximately reconstruct the original data. The mixing matrix A is no longer
constrained to be square as is the case in most ICA algorithms, permitting the recovery of
as many sources as required. A frequently used cost function which combines the goal of
small reconstruction error with that of obtaining sparse sources is:
C (A , s ) =
1
2
x − As
2
( )
+ λ ∑ G S ij (3.55)
ij
The parameter λ controls the tradeoff between sparseness and accurate reconstruction of
the original data, and G is a function that defines how the sparseness of the sources is
100
Information Theoretic Approaches to Computational Audition
101
Information Theoretic Approaches to Computational Audition
C (A , s ) =
1
2
x − As
2
( ) ( )
+ λ ∑ G Aij + β ∑ T Aij (3.56)
ij ij
In this case, Aij is the gain of the jth source in the ith frame of the spectrogram and β
controls the tradeoff between temporal coherence and the other parameters. T(Aij) is a
function which promotes temporal coherence of the sources and is defined as:
( )
T Aij =
1
∑ ai −1, j − ai, j
2 ij
(3.57)
In a further attempt to improve the perceptual relvance of the data presented to the sparse
coding algorithm, the input data is first weighted by the frequency response of the ear.
This procedure is then undone during the re-synthesis phase of the algorithm.
After the sparse coding algorithm has converged the end result is the
decomposition of an input spectrogram into a set of invariant frequency basis functions
and invariant time basis functions in a manner similar to that of ISA. Where the
algorithms differ is in how the decomposition of the spectrogram is achieved. ISA can be
summed up as dimensional reduction followed by achieving independence of the data
remaining after dimensional reduction, whereas in sparse coding the dimensional
reduction is achieved in balancing the need for accurate reconstruction of the original
data with the sparseness of the recovered sources. Further, Virtanen’s approach to sparse
102
Information Theoretic Approaches to Computational Audition
coding also suffers from the limitation that the spectrogram has to be stationary in pitch.
Both methods also suffer from the problem of estimating the required number of basis
functions for separation of the sounds, and both methods have difficulty in dealing with
sources of low amplitude. Indeed, Virtanen’s system is stated as having difficulty in
separation of drums such as hi-hats for just such a reason.
The drum transcription system described attempted to transcribe snare and bass
drums in the presence of pitched instruments. Virtanen tested the algorithm by
synthesising single channel audio tracks from General Midi files. Half of the test set was
used for training the parameters of the sparse coding algorithm, and the generation of
templates to identify the snare and bass drums. The templates were obtained by
comparing the events detected in the separated sources with the actual snare or bass drum
events. The separated sources which gave the best matches to the snare and bass drum
events were selected. Templates for both snare and bass drum were then generated from
the average spectral characteristics of the selected sources. The remainder of the test set
was used to test the performance of the algorithm. Virtanen attempted to overcome the
problem of estimating the number of sources required for separation by obtaining a fixed
number of sources, and then searching for sources that matched well with the template
spectra of either the snare or the bass drum. The goodness of fit of a separated source to a
given drum was obtained from:
F S j ( f )+ ε
d ( j , m ) = ∑ Rm ( f ) log (3.59)
f =1 Rm ( f ) + ε
where Rm(f) is the template spectrum for drum type m, Sj(f) is the separated spectrum of
the jth source , f is the frequency bin index, F is the number of frequency bins, and ε is a
small positive value used to make the log robust for small values of the spectrum.
Once the sources related to the snare and bass drum have been identified, onset
detection is carried out on the amplitude envelopes of these sources. System performance
was evalutated using the following error rate measure:
z = (N d + N i ) / (N c + N d + N i ) (3.60)
103
Information Theoretic Approaches to Computational Audition
drums. It should be noted that, as the signals were synthesised from a General Midi sound
set, only two snares and two bass drums were used. This means that the templates created
may not generalise well to snare and bass drum sounds which are not similar to those in
the General Midi sound set used. However, it should be pointed out that the drum
transcription experiments were designed only to be a demonstration of the source
separation abilities of the sparse coding system described as Virtanen had observed that
the method was capable of separating drum sounds from many real world audio excerpts.
[Virtanen 03a].
Although ISA as presented above represents a new approach to separating sound sources
from a single channel audio mixture, similar approaches have been in use in vision
research for a number of years. Of particular interest is the approach taken by Stone and
Porill in attempting to separate time-varying image sequences [Stone 99]. The images
were rearranged into a single vector where each position represented a given pixel and
successive vectors described the evolution of the vectors in time. The images were
composed of mixtures of a number of source images, each of which varied independently
in amplitude over time. In other words, the data was arranged in a matrix format where
one axis represented the spatial position of the image, and the other axis represents its
evolution in time. This situation is analogous to the evolution of a mixture of sound
sources in time contained in a spectrogram, the only difference being that instead of
positions in frequency there are now positions in space.
ISA as formulated by Casey can achieve only independence in either time or
frequency, but not both simultaneously. This is because, when using singular value
decomposition (SVD) on a time-frequency spectrum to obtain bases for independent
component extraction, two sets of bases are obtained, one time based, the other frequency
based. If the time bases are used, then a set of mutually independent time signals is
obtained, and a corresponding set of unconstrained frequency signals is obtained.
Similarly, if the frequency bases are used, a set of mutually independent frequency
signals is obtained from ICA, with a corresponding (dual) set of unconstrained time
signals. The unconstrained nature of the dual signals allows them to sometimes take
104
Information Theoretic Approaches to Computational Audition
physically unlikely shapes in order to satisfy the requirement that their corresponding
independent components are statistically independent. In other words, the independence
of the extracted signals can be obtained sometimes at the cost of physically improbable
forms for the dual signals.
Stone and Porrill came up against a similar problem when trying to separate time
varying image sequences. As already noted, instead of the time-frequency representation
found in musical signals, they had a time-space representation of the image signals. In an
effort to overcome the limitations of standard ICA algorithms which can only deal with
one set of bases, leaving the dual set of signals unconstrained, they introduced the
concept of spatiotemporal ICA (stICA). Making the assumption that the images and their
respective time courses are statistically independent, both the images and time courses
are given equal importance. Instead of seeking independence in time or space, stICA
attempts to maximise the degree of independence over time and space without necessarily
producing independence in either space or time. Similarly, it can be assumed for musical
signals that when a given instrument is played depends on when the player decides to
play, which is independent of the frequency spectrum generated by the instrument.
Consider an m x n matrix containing a sequence of n images X = (x1,…,xn). X can
be linearly decomposed into k modes using a matrix factorisation X = SΛTT where S
contains k image vectors, T contains k time vectors and Λ is a diagonal matrix of scaling
parameters. The rank of the problem can be reduced using SVD, giving
~ ~~
X = USV T = UV (3.61)
~
where X ≈ X , U is an m x k matrix of image vectors, V is an n x k matrix of time vectors,
~ ~ ~
and S is a diagonal matrix of singular values. U is defined as U = US1/2 and V = VS1/2.
~ ~
Doing ICA on the image vectors in U yields the decomposition S = U Ws where Ws is
an unmixing matrix. Using the Infomax approach Ws is obtained by maximising the
entropy hs = H(Ys) of Ys = σ(S) where σ approximates the continuous density function
~
(cdf) of the spatial or image signals. The time vectors V can be decomposed in a similar
~
manner, T = V WT , by maximising the entropy hT = H(YT) of YT = τ(T) where τ
approximates the cdf of the temporal signals.
105
Information Theoretic Approaches to Computational Audition
~
Spatiotemporal ICA assumes that X can be decomposed into a set of mutually
~
independent image vectors and a set of mutually independent time vectors: X = SΛTT
~
where Λ is required to ensure that S and T have amplitudes appropriate to their cdfs. If X
~ ~ ~ ~
=U V = SΛTT then by substituting forU and V it follows that:
Λ = Ws-1(WT-1)T (3.62)
Ws and WT can be found by maximising a function hST of the spatial and temporal
entropies of the extracted signals. The function hST is defined as :
hST = α hS + (1 - α) hT . (3.63)
where α defines the relative weighting between spatial and temporal entropy.
106
Information Theoretic Approaches to Computational Audition
independent in frequency and time. The axis that represented spatial position now
indicates position in the frequency spectrum, but otherwise the formulation remains
unchanged. As an example, spatiotemporal ICA was performed on the spectrogram
shown in Figure 3.7. The resulting basis functions in both time and frequency are shown
in Figures 3.15 (see Appendix 2 for audio examples) and 3.16.
107
Information Theoretic Approaches to Computational Audition
expense of the separation of the snare. When listening to the re-synthesised sound sources
there is no qualitative improvement in sound quality.
Further testing with spatiotemporal ICA showed minor improvements in some
aspects of the sound source separation at the expense of minor degradations in other
aspects. This serves to informally validate an observation by Smaragdis, namely that for
musical signals independence in frequency usually also amounts to independence in time
[Smaragdis 01]. Thus, despite the fact that when an instrument is played is independent
of the frequency spectra it produces, in practice spatiotemporal ICA yields very little if
any improvement over ISA.
108
Information Theoretic Approaches to Computational Audition
examples of possible distance metrics can be found in [Saul 03]. Reconstruction errors
are then measured by:
2
ε (W ) = ∑ X i − ∑ j Wij X j (3.64)
i
where the weights Wij contain the contribution of the jth vector to the reconstruction of
the ith vector. To obtain the Wij the above cost function is minimised subject to two
constraints. Firstly, each vector Xi can only be reconstructed from its K nearest
neighbours, in effect forcing Wij = 0 if Xj is not one of the nearest neighbours. Secondly,
the rows of the weights matrix are constrained to sum to one, i.e. ∑Wj ij = 1. The optimal
weights can then be found by solving a set of constrained least squares problems.
An important property of these constrained error minimising weights is that, for
any given vector, they are invariant to rotations, rescalings and translations of that vector
and its K nearest neighbours. The invariance to rotations and scalings comes from the
form of equation 3.64 and the invariance to translation is enforced by the constraint that
the rows of the weights matrix sum to one. As a result of these invariances, the weights
characterise intrinsic geometric properties of each neighbourhood, as opposed to
properties that depend on the frame of reference employed.
The data is then assumed to be on or near a smoothly varying non-linear
manifold, with the dimensionality of the manifold being d << D. It is then assumed that
there exists a linear mapping, consisting of a translation, rotation and rescaling, which
maps the high dimensional neighbourhoods to global coordinates on the underlying
manifold. As the reconstruction weights Wij are invariant to translation, rotation and
rescaling, their characterisation of local geometry in the original data can be expected to
be equally valid for local pieces of the underlying manifold. In other words, the weights
Wij that reconstruct the original vectors Xi of dimensionality D can also be used to
reconstruct the underlying manifold in d dimensions.
The final step in LLE is then to map the high dimensional inputs Xi to a low
dimensional output Ri which represent the underlying manifold. This is done by finding
the d dimensional coordinates of each Ri to minimise the embedding cost function:
Φ (R ) = ∑ Ri − ∑ j Wij R j (3.65)
i
109
Information Theoretic Approaches to Computational Audition
As can be seen, the cost function is very similar to that of equation 3.64, and is again
based on locally linear reconstruction errors. However, in this case the weights Wij are
fixed and the outputs Ri are optimised. The embedding is calculated directly from the Wij
without reference to the original inputs Xi . In effect the algorithm finds low dimensional
outputs Ri that can be reconstructed from the same weights Wij as the original high
dimensional data Xi .
The embedding cost function is optimised by solving a sparse N x N eigenvalue
problem, which is a global operation over all the data points. This contrasts with the fact
that the reconstruction weights are calculated from the local neighbourhood of each input.
This is how the algorithm attempts to discover global structure; it attempts to integrate
information from overlapping local neighbourhoods. Like PCA the resultant outputs Ri
are orthogonal to each other. This is achieved in solving the eigenvalue problem. As a
result of this, LLE shares the property with PCA that only as many outputs Ri as required
need be calculated.
LLE has proved sucessful in determining the underlying structure of high
dimensional data in cases where PCA has failed to obtain the underlying structure. LLE
appeals to the underlying local geometry of the data presented to it to carry out
dimensional reduction, whereas PCA carries out dimensional reduction with reference to
the variance of the data. In some cases this appeal to local geometry provides a more
salient description of the data than a variance based approach.
An example, taken from [Saul 03] is shown in Figure 3.17. LLE was applied to a
number of images which were created by translating an image of a face across a
background of random noise. The noise was uncorrelated from one image to the next, so
the only consistent stucture in the resulting images can be described by a two-
dimensional manifold parametrised by the center of mass of the face image. Over 950
images were presented to the LLE algorithm. These images were 59 by 51 pixels in size,
resulting in a dimensionality of 3009 for each image. The top figure shows the structure
obtained from the first two principal components obtained using PCA; the bottom figure
shows the structure obtained from a two dimensional embedding using LLE. As can be
seen, the corner images have been successfully mapped to the corners using LLE,
whereas PCA has not maintained the correct neighbourhood structure of nearby images.
110
Information Theoretic Approaches to Computational Audition
Figure 3.17. LLE vs. PCA for face image translated in 2-D against a background of noise
The only parameters for the algorithm are chosing the number of dimensions d to
represent the data, and the number of neighbours K for each data point. It has been
observed in [Saul 03] that the results of LLE do not depend sensitively on the number of
nearest neighbours, with the provisions that K must be greater than d and that too high a
value for K invalidates the assumption that a vector and its neighbours can be modelled
linearly.
As noted previously, PCA performs redundancy reduction based on variance. As
a result, PCA is biased towards the loudest sources when attempting to separate sources
from a spectrogram. The number of components that needs to be retained to identify all
the sources present varies with the relative amplitude of the sources. This can cause
difficulties in drum transcription where the hi-hats or ride cymbals tend to have much
lower amplitudes than the snare or bass drum. LLE, on the other hand, determines
components based on regions of similarity (or local neighbourhoods). Therefore, LLE is
111
Information Theoretic Approaches to Computational Audition
less prone to variations in relative amplitude between sources in the mixture spectrogram,
and the variation in the number of components required to identify sources is less severe.
Taking this into account, it was decided to attempt to use LLE to capture the
structure of spectrograms consisting of a mixture of sources. This work was first
presented in [FitzGerald 03c]. Consider a spectrogram Y of size n x m, where n is the
number of frequency channels and m is the number of time slices. Then, with regards to
the LLE algorithm, the dimensionality D of the data is given by n and the number of
input vectors N is given by m. The outputs Ri are in this case taken to represent the
evolution of similar neighbourhoods through the spectrogram. These similar
neighbourhoods are made up of time slices that have similar frequency content, and so
the outputs should capture events in the spectrogram that have similar frequency content.
112
Information Theoretic Approaches to Computational Audition
identified, the hi-hats only show up as very small peaks in the third principal component.
It can be seen that LLE has more successfully captured the hi-hats, which were low in
amplitude relative to the snare and bass drum.
Despite capturing the overall structure of the sources, smaller peaks are still
visible in the output LLE vectors where other drums occur. These peaks are possibly due
to the fact that some of the neighbourhoods integrated in the final step of LLE may
consist of neighbours that belong to more than one drum, especially in cases where drums
occur simultaneously.
To overcome this, we passed the outputs from LLE to an ICA algorithm in a
similar manner to the way the outputs from PCA are passed to an ICA algorithm in ISA.
Figure 3.20 shows the independent sources obtained if the Ri shown in Figure 3.18 are
transformed using ICA. As can be seen, improved separation of the sources has occurred
with noticeably clearer peaks for both the bass drum and hi-hats.
113
Information Theoretic Approaches to Computational Audition
114
Information Theoretic Approaches to Computational Audition
As noted previously, the results of LLE do not depend sensitively on the choice of
the number of nearest neighbours. Choosing different values for K results in outputs that
essentially capture the same information about the sources. Figure 3.21 above shows the
results obtained by carrying out LLE with K = 50 on the same drum loop as in Figure
3.18. The sources have been captured in the same order, and the same main peaks occur
in each source but the overlap between the sources is different. In this case the snare
vector shows little evidence of the hi-hats, which instead show up in the bass drum
vector, and the hi-hat peaks are more consistent in their amplitudes in the hi-hat vector.
Figure 3.22. Independent Components obtained from ICA of LLE outputs (K =50)
When the Ri for K = 50 are transformed using ICA the sources are again recovered
correctly. The independent components obtained are shown in Figure 3.22. However, in
this case, performing ICA has lead to a reduction in recognisability for the hi-hats, with
the dominant peaks in the hi-hat component being those of the snare drum. This occurs as
a result of the two prominent local minima present in the LLE hi-hat vector. As ICA is
invariant to scaling, these two minima are regarded by the ICA algorithm to be as
important as the peaks.
115
Information Theoretic Approaches to Computational Audition
This highlights the fact that while LLE itself is not particularly sensitive to the
choice of K, using LLE as a substitute for PCA in ISA results in an increased sensitivity
to the choice of K, particularly at low dimensions. Careful choice of K results in LLE
vectors which give better separation when passed to the ICA step of ISA. However, at
present there is no suitable method for choosing K, resulting in the need for an observer
to allow the algorithm to perform optimally. This restricts the robustness of using LLE
for the dimensional reduction step of ISA. However, this problem was found to be less
severe when higher numbers of components, typically greater than 10, were recovered
from the LLE algorithm.
116
Information Theoretic Approaches to Computational Audition
Figure 3.23, above, shows the Ri obtained from LLE for a different drum loop again
containing snare, hi-hat and bass drum. Events marked ‘x’ are hi-hats, ‘b’ bass drum and
‘s’ snare drum. There is considerable overlap between the sources so that none can be
clearly identified from the outputs from the LLE algorithm. Further processing with ICA
or changing the number of nearest neighbours does nothing to improve this situation. It
was found that increasing the number of components obtained from LLE to 10 or greater
was found to improve the likelihood of obtaining the correct separation of the sources.
Obtaining higher numbers of components from LLE can help eliminate some of
the problems encountered at lower dimensionality when using the algorithm. However,
the use of increased numbers of components comes at a price, namely the problem of
clustering the multiple components, which remains as much a problem as when carrying
out standard ISA, thus affecting the robustness of ISA using LLE for the purposes of
drum transcription.
Due to its emphasis on local neighbourhoods (or regions of similar frequency
content), as opposed to the variance based approach of PCA, LLE has proved capable of
extracting the required information on the sources present in a mixture spectrogram in
cases where PCA does not do so adequately. This makes it a potentially useful tool for
the purposes of drum transcription. However, this must be balanced by the fact that, at
low dimensionality, if too much overlap between sources occurs in the local
neighbourhoods obtained from LLE, then the sources will not be characterised
adequately. Also, when using LLE instead of PCA in ISA, it should be noted that at
lower dimensionality the choice of K becomes critical to obtaining good results from the
ICA step of the algorithm. At higher dimensions the above problems are overcome, but at
the expense of encountering the clustering problem also encountered at higher
dimensionality with standard ISA. Thus, despite having some attractive properties, such
as its geometric, non-variance based dimensional reduction, LLE is not robust enough a
technique to employ for drum transcription purposes.
3.7 Conclusions
As noted at the start of this chapter, information theory has been successfully used to
model aspects of perception including how we perceive sounds and there is evidence that
117
Information Theoretic Approaches to Computational Audition
the auditory system does carry out redundancy reduction. Using an information theoretic
approach has also shown that some of the grouping cues stated by Bregman can in fact be
replaced by a single information theoretic rule. This simplification leads to systems
which are less difficult to implement computationally than those that attempt to use the
rules stated by Bregman directly, which deal with quantities that have proved difficult to
implement computationally. This difficulty has already been outlined in Chapter 2, which
includes analyses of sound separation/analysis systems using Bregman’s cues. Another
strength of these techniques is that they are general techniques which are not specifically
aimed at audio signals.
The techniques of PCA and ICA were reviewed, and the limitations of each
technique summarised, such as the requirement of most ICA algorithms that as many
sensors as sources are required for separation. However, if this limitation can be
overcome ICA may be of greater utility for single channel sound separation in the future.
Next ISA, which combines both PCA and ICA to allow source separation from single
channel mixtures, was investigated. ISA has proved to be capable, within its limitations,
of separating efficiently sound sources without redress to large numbers of heuristic
based rules and is at present the state of the art in information theoretic single channel
separation algorithms for music, even though the same techniques can also be applied to
other types of data. However, there still remain a number of problems to be overcome
with the method, such as estimation of the required number of bases for optimal
separation and the bias towards louder sources introduced by the use of PCA for
dimensional reduction. The clustering of subspaces also remains problematic.
Despite these problems, ISA-type methods were found to have potential for the
automatic transcription of drums. The assumptions inherent in ISA make them
particularly suitable for analysing drum loops, as the assumption that pitch has to remain
stationary for the duration of the spectrogram holds well for drum loops. The use of
statistics across the whole excerpt allow ISA to overcome some of the problems in
separating and identifying mixtures of drums encountered using other methods such as
that by Sillanpää et al [Sillanpää 00].
Sparse Coding techniques were then examined, in particular with reference to
their application to the problem of drum transcription. It was found that the use of Sparse
118
Information Theoretic Approaches to Computational Audition
Coding, while sharing the same strengths as ISA, also suffered from many of the same
problems and limitations, in particular the estimation of the number of components
required for separation, and the problem of identifying and recovering low energy
sources. Indeed the underlying signal models of the two methods are so similar that
Sparse Coding can be viewed as an ISA-type method, the only difference being how the
spectrogram is decomposed into a set of basis functions.
Some potentially useful extensions to ISA were looked at, such as spatiotemporal
ICA, which, though of potential use due to its attempt to obtain independence in both
time and frequency, in practice gave very similar results to standard ISA. Finally, LLE
was looked at as a technique which offered the possibility of overcoming some of the
problems associated with variance based dimensional reduction techniques such as PCA.
While it did succeed in overcoming some of the problems associated with PCA, the
algorithm was found not to be robust enough to employ as a dimensional reduction stage
in ISA.
It can be seen that using an information theoretic approach offers some
advantages over other approaches in creating systems for the extraction of information
from single channel mixtures, both because of its ability to describe aspects of human
perception and because of successes, within certain limitations, in separating sounds from
single channel mixtures.
119
Drum Transcription Systems
As can be seen from the previous two chapters, there are a number of possible approaches
to the implementation of drum transcription systems. The lack of research in this area has
also been highlighted, with only four “drums only” transcription systems and two “drum
transcription in the presence of pitched instruments” uncovered in the literature review.
Most of these systems have focused on modelling each drum individually and then
attempting to use mixtures of these models in an attempt to transcribe the drums. To date,
such systems have met with very limited success. The system described by Goto [Goto
94] only contains two examples of transcription and does not contain results from a large
number of tests. Similarly, the systems described by Sillanpää [Sillanpää 00] and
Jørgensen [Jørgensen 01] do not contain any systematic evaluation of transcription
results.
The three remaining, and most recent, systems do contain evaluations of the
performance of the algorithms. The “drums only” transcription system described by
Paulus attempts to overcome the problem of dealing with mixtures of drums by explicitly
modelling the drum mixtures [Paulus 03]. However, the error rates for transcription based
on the mixture models are quite high, and only with the use of probabilistic models of
drum sequences does the error rate fall. The system described by Zils [Zils 02] for
transcribing snare and bass drum in the presence of pitched instruments ignores the
problem of dealing with mixtures of drum sounds by giving occurrences of bass drums
priority over those of snare drums. Apart from the work of the author, only recent work
by Virtanen attempts to deal with transcribing drum mixtures in the presence of pitched
instruments [Virtanen 03].
Other more general sound source separation methods, such as the DUET
algorithm [Yilmaz 02] and some CASA schemes, were also discussed. While the DUET
algorithm has been shown to be of potential use when each of the drum sounds has a
different pan position, the existence of separate pan positions for each drum sound cannot
be guaranteed for a given recording and the algorithm will not work on mono recordings,
thus eliminating the DUET algorithm as a potential candidate for use in drum
transcription.
120
Drum Transcription Systems
As these approaches based solely on drum models have not met with much
recorded success, it was decided to adopt a redundancy reduction based approach to drum
transcription. Of the various redundancy reduction based methods discussed in the
previous chapter, an ISA-type approach appeared to be the most suitable for the purposes
of drum transcription. This is because ISA-type approaches have shown the potential to
be able to separate mixtures of drum sounds from single channel mixtures [Casey 00],
[Virtanen 03], and so can overcome some of the problems encountered by the model
based approaches, namely how to deal with identifying mixtures of drum sounds. Also,
the assumption of stationary pitch inherent in the ISA model is, as previously noted, valid
for the drum sounds of interest, which makes it particularly suited for dealing with drum
sounds. It was therefore decided to take ISA as a starting point for the implementation of
a robust drum transcription system.
As noted previously, however, there are a number of problems associated with
ISA, most importantly the problem of estimating the required number of basis functions
to allow separation and identification of the sources present. Attempts have been made to
overcome this problem by replacing PCA in the redundancy reduction step with other
redundancy reduction methods such as LLE, but this was not found to be robust enough
for transcription purposes [FitzGerald 03]. It should also be noted that ISA is a totally
blind source separation method, which contains no information about the sources to be
separated. By incorporating knowledge about the sources of interest, in this case drum
sounds, it is possible to overcome the problem of estimating the required number of basis
functions and so create robust systems for drum transcription, in effect creating hybrid
systems which combine the model based approaches to drum transcription outlined in
chapter two, with the statistical/redundancy reduction approaches of chapter three. It was
decided not to use any form of rhythmic modelling or models of typical drum patterns to
aid the transcription as it was felt that a system that was capable of working without the
use of such models would be capable of operating in more general circumstances than
one that contained such models. The methods created for the incorporation of prior
knowledge into ISA type schemes are outlined in the remaining sections of this chapter.
121
Drum Transcription Systems
As noted previously, the number of basis functions required to separate the sources using
ISA varies depending on the frequency characteristics and relative amplitudes of the
sources present. As a result, using thresholding methods to decide how much information
to retain from the PCA stage becomes impractical. To overcome this problem, and to
allow recovery of the necessary information to transcribe drums successfully, a sub-band
processing step can be added to the ISA method [FitzGerald 02].
The addition of sub-band processing to the ISA method is motivated by observing
some general properties of drums as used in popular music which were previously
discussed in section 1.3. The drums in a standard rock kit can be divided into two types:
drums where a membrane or skin is struck, including snares, toms, and kick drums, and
drums where a metal plate is struck, including hi-hats and cymbals. The membrane drums
have most of their energy in the low end of the frequency range, below 1kHz and the
plate drums have most of their energy spread out over the spectrum above 2 kHz
[Fletcher 98].
122
Drum Transcription Systems
This is illustrated in Figure 4.1, where the intense regions below 1 kHz
correspond to the occurrence of membrane drums. Also, in most popular music the
membrane drums are mixed louder in the recordings than the metal plate drums. This
means that the membrane drums dominate in ISA analysis of the input signals.
It is possible to make use of the frequency characteristics of the drums to improve
the robustness of the ISA method for transcription purposes by using sub-band
processing. The signal is split into two bands, a low pass band for transcribing the
skinned drums, and a high pass band for the metal drums. The low pass filter had a cutoff
frequency of 1kHz, and the high pass filter had a cutoff frequency of 2 kHz. Low-order
Butterworth filters were used for both filters. The high pass filter has the effect of
removing a large amount of the energy of the membrane drums, thus allowing the metal
plate drums to be identified with greater ease.
123
Drum Transcription Systems
functions, and the information recovered is cleaner than that recovered using ISA with 5
basis functions.
To demonstrate the robustness of sub-band ISA a simple drum transcription system was
implemented in Matlab. The system is limited, but effective within the confines of its
limitations. It contains no explicit models of the drum types and contains no rhythmic
models, but does make a number of assumptions. Firstly, it is assumed that only three
drums are present in the test signals, snare drums, kick drums and hi-hats. The basis for
this assumption is that the basic drum patterns found in popular music consist largely of
these three drums. Secondly, it is assumed that the hi-hat occurs more frequently than the
snare drum. Again this assumption holds for most drum patterns in popular music.
Thirdly, it is assumed that the kick drum has a lower spectral centroid than the snare
drum. This assumption is justified in that snare drums are perceptually brighter than kick
drums, and the brightness of sounds has been found to correlate well with the spectral
centroid [Gordon 78]. These assumptions are necessary to overcome the source ordering
problem inherent in the use of Independent Component Analysis (ICA). The use of sub-
band processing ensures that only two basis functions are required in each band to
separate the components. A Matlab implementation of the algorithm can be found in
code\sub-band ISA\SISA.m in the accompanying CD.
Analysis starts with the signal being filtered into two bands as described
previously. The low-pass signal is then passed to the ISA algorithm with only two basis
functions kept from the PCA step. The spectral centroids of the separated components are
calculated, and the component with the lowest centroid identified as the kick drum. The
other component is then identified as the snare. As separation of the sources is not
perfect, the amplitude envelopes are normalised and all peaks above a threshold are taken
as an occurrence of a given drum.
Initially, onset times were calculated using a variation of the onset detection
algorithm proposed by Klapuri [Klapuri 99]. As the amplitude envelopes contain only
drum sounds which have clearly defined onsets there was no need for the use of
multiband processing to detect onsets, and so using the relative difference of the
124
Drum Transcription Systems
amplitude function was sufficient to obtain the onset times. However, on further
investigation it was discovered that using the time of the actual peak resulted in a slight
improvement in the accuracy of the onset times. The use of the relative difference
function resulted in onset times that were on average too early, while using the time of
the actual peak was on average too late, but was closer to the actual onset time. Using the
peak times as onsets resulted in an average error of just under 10ms between the actual
onset times and the detected onset times.
The high-pass signal is processed in a similar manner, with the hi-hat determined
as the basis function that has the most peaks in amplitude over the threshold. The
remaining basis function contains the high frequency energy from the snare drum that has
not been removed in filtering.
The system was tested on 15 drum loops containing snares, hi-hats and kick
drums. The drums were taken from various sample CDs and were chosen to cover the
wide variations in sound within each type of drum. The drum patterns used were
examples of commonly found patterns in rock music, as well as variations on these
patterns. The tempos used ranged from 80 bpm to 150 bpm and different meters were
used, including 4/4, 3/4 and 12/8. Relative amplitudes between the drums were varied
between 0 dBs to –24 dBs to cover a wide range of situations and to make the tests as
realistic as possible. The same set of analysis parameters was used on all the test signals.
The test signals all had a sample rate of 44.1 kHz. A 2048 sample window zero-padded
to 4096 samples was used for carrying out the STFT, and the hopsize between frames
was 256 samples. Detected onsets within +/- 30ms of the actual onset were taken as being
correct. Unless otherwise stated these parameters were also used for the succeeding
algorithms. The results of the tests are summarized in Table 4.1. The percentage correct
is obtained from the following formula:
total − undetected − incorrect
correct = ⋅100 (4.1)
total
This is also used to calculate percentages in all succeeding tests unless otherwise stated.
125
Drum Transcription Systems
Snare 21 0 2 90.5
Kick 33 0 0 100
Hats 79 6 6 84.8
While sub-band ISA goes some way to overcoming some of the problems associated with
ISA for the purposes of drum transcription, it still results in more subspaces than
sources. However, by considering the origins of ISA and the assumptions inherent in the
126
Drum Transcription Systems
ISA model, it is possible to formulate a version of subspace analysis that can incorporate
actual models of the drums or sources of interest.
127
Drum Transcription Systems
occur with every event then the analysis will not succeed. However, in cases where
drums only are present this assumption holds reasonably well, as drums do not change in
pitch from occurrence to occurrence. It is important to note that the assumption of
invariant frequency and amplitude basis functions results from the fact that both PCA and
ICA perform linear transformations of the original data.
The success of Casey’s original aim of finding invariants that allow classification and
identification of sounds is well documented [Casey 01]. This suggests that it is possible
to find invariants that will be good approximations to many occurrences of a given type
of sound, for example a snare drum. Therefore, by analysing large numbers of each drum
sound as per Casey and then creating a model of the drum by means of an algorithm such
as the k-means algorithm, it should be possible to arrive at a small number of invariants
that characterise a given class of drum. These invariants can then be used as prior
subspaces with which to carry out initial analysis of the audio extract. As the amplitude
envelope for a drum pattern depends on the pattern played by the drummer, it is the
frequency invariants that are used to obtain our prior subspaces.
Once the prior subspaces have been obtained, then multiplying a 1 × n prior
frequency subspace corresponding to a drum known to be present in the audio signal with
an n × m spectrogram should yield a 1 × m amplitude envelope that approximates the
actual amplitude envelope of the drum. The same process can be carried out for each of
the drums known to be present, resulting in a set of amplitude envelopes that
approximately correspond to the amplitude envelopes of the drums present.
However, due to the broadband noise nature of drums smaller peaks will occur as
a result of the other drums present. To overcome this, ICA can be carried out on the
amplitude envelopes of the prior subspaces. This results in a set of independent amplitude
envelopes that correspond to each drum present. From these independent amplitude
envelopes a new set of frequency basis functions can be obtained to allow re-synthesis of
the original sounds. It is proposed to call this technique Prior Subspace Analysis (PSA)
[FitzGerald 03a].
128
Drum Transcription Systems
Stated formally, the PSA model can be described as follows. The original single
channel sound mixture signal is assumed to be a sum of p unknown independent sources:
p
s (t ) = ∑ s q (t ) (4.2)
q =1
129
Drum Transcription Systems
(
tˆ = f pp Y T ) (4.9)
where fpp is the pseudo-inverse of fpr . However the estimated amplitude basis functions
returned are not independent and so ICA is carried out on t̂ to give:
(
t = Wtˆ T )
T
(4.10)
where W is the unmixing matrix obtained from ICA and t containes the independent
amplitude basis functions. Improved estimates of the frequency basis functions can then
be obtained from
f = Yt pT (4.11)
Re-synthesis of the independent spectrograms can then be carried out in a manner similar
to that of ISA, with estimation of the phase information carried out as described in
section 3.3.
The PSA algorithm can be summarised in pseudo-code as follows:
1. Carry out an STFT on the input signal.
2. Obtain a magnitude spectrogram from the magnitude of the STFT values.
3. ( )
Estimate amplitude basis functions t̂ from tˆ = f pp Y T where fpp is the pseudo-
source.
7. Resynthesise using the original phase information or via phase estimation method
such as [Griffin 84] or [Slaney 96].
130
Drum Transcription Systems
Figure 4.3. Prior Subspaces for Snare, bass drum and hi-hat.
Figure 4.3 shows a set of prior subspaces obtained for snare, bass drum and hi-hat
respectively. These prior subspaces were obtained by performing PCA followed by ICA
on large numbers of each drum type. Three components were retained from the PCA step.
These were then analysed using ICA and the independent component with the largest
projected variance for each drum sample was then retained. The independent components
obtained from each sample of a given drum type were then clustered using a k-means
clustering algorithm to give a single prior subspace for each drum type.
Figure 4.4 shows the independent amplitude envelopes obtained by performing
PSA on a drum loop. The relevant audio examples can be found in Appendix 2. As can
be seen, Prior Subspace Analysis (PSA) successfully separated 3 subspaces
corresponding to the individual drums. The main distinction between PSA and ISA is that
as its name implies, prior subspace analysis requires prior knowledge of what sources are
present, and prior information about the invariants associated with these sources. It only
looks for as many sources as are known to be present and can find them efficiently. The
use of prior subspaces also overcomes the problem of sources at lower relative
131
Drum Transcription Systems
amplitudes to other sources in the mixture which was a problem when using PCA for the
initial decomposition. However, PSA still suffers from the limitation that the source
signals cannot be recovered in the order in which they came in, thus necessitating the
identification of the recovered subspaces with the original prior subspaces by some
means such as their frequency characteristics or their amplitude envelopes.
132
Drum Transcription Systems
conditions where pitch changes occur in the analysis window. Figure 4.5 contains the
separated drums from a mixture of bass drum, snare drum, and 4 pitched synthesised
sounds. However, the separation in the presence of pitched instruments depends on the
number and type of instruments present and it may be necessary to use further signal
processing techniques to help reduce the influence of the pitched instruments and so
allow transcription to take place.
To test the robustness of PSA in as wide a range of circumstances as possible, and to get
an idea of the limitations of the method in general, the algorithm was tested on a large
number of synthetic signals.
As the PSA algorithm assumes that a mixture spectrogram is composed of a sum
of outer products, it was decided to create the test signals by combining sums of outer
products. In other words, the mixture spectrograms were created directly in the time-
frequency domain, eliminating the need to carry out STFTs on time domain signals to
generate spectrograms, which sped up the testing process.
133
Drum Transcription Systems
Test signals consisted of mixtures of two sources, where each sound source was
created by summing the outer products of two frequency vectors with two time vectors.
As drum sounds such as snares and kicks have narrow regions in frequency where most
of the energy of the drum is to be found, with a wider region where less energy was to be
found it was decided to approximate this situation by using two hanning windows, one of
size 50 samples, the other of size 100 samples. These were then zero-padded at either end
to create frequency vectors of length 2000. The amplitude envelope associated with the
hanning window of size 50 samples was a linear ramp going from 1 to 0 of length 30.
The amplitude envelopes for the hanning window of size 100 samples was a decaying
exponential. Each source occurred twice, but did not overlap with the occurrence of the
other source. The time and frequency vectors used for the first source are shown below in
Figure 4.6. The frequency vectors for the second source were the same as for the first
source, only shifted right by a number of frequency bins. This shift was varied to test the
effect of the amount of frequency overlap of the sources on the PSA algorithm. The time
vectors for the second source were the same as those for the first source, only shifted to
the right by 50 time frames.
The test spectrograms were then created by summing the outer product of each
pair of time and frequency vectors. The prior subspace was then taken to be the first
frequency vector for each source, and the PSA algorithm then run on the test
spectrogram. The ICA algorithm used was the Jade algorithm [Cardoso 93] as this was
found to be more stable than FastICA [Hyvärinen 99], which did not converge in all
cases. To overcome the source ordering problem inherent in the use of ICA, the largest
peak in each of the recovered signals was identified. The recovered signal was then
associated with the original source that contained the peak in question. As the separation
achieved in the ICA step of the algorithm is not perfect, all peaks detected over a
threshold of 0.2 in the normalised outputs of the ICA algorithm were recorded as events.
As mentioned already, the first parameter varied in testing was the amount of
overlap in frequency between the sources. The amount of frequency offset of the second
source was varied from 1 to 50 frequency bins, thus testing conditions ranging from
almost total overlap of the first frequency vectors to no overlap. The second parameter
varied in testing was the relative amplitude of the sources. This was controlled by scaling
134
Drum Transcription Systems
the time vectors of the second source by a number between 1 and 0. The step size for
decreasing the gain of the second source was 1/50, giving 50 steps. This allowed
measurement of the effects of relative amplitude between the sources on the PSA
algorithm. Taking each possible frequency offset and relative amplitude gives 2500 tests.
135
Drum Transcription Systems
The combination of the four parameters results in a total of 120000 tests of the
PSA algorithm. The code used for these tests can be found in code\Psa\testPSA.m in the
accompanying CD. The percentage correctness across all the tests, measured using the
formula shown in equation 4.1, is 76%. While this is an encouraging figure, and
demonstrates that the algorithm is robust under a wide range of conditions, it is more
instructive to look at how the test results varied with the test parameters.
Figure 4.7 shows the average results obtained for frequency overlap of the sources
and the relative amplitude of the sources for all 48 possible variations in prior subspace
mismatch and relative amplitude of the first and second frequency vectors within a given
source. It is important to note that these are averages, and that the effects of variations in
prior subspace mismatch and relative amplitude between frequency vectors will be
discussed later. The correctness or otherwise of the test was then determined as per
equation 4.1. This measure of correctness was used as it is quite severe in punishing
spurious detections, resulting in a negative measure of correctness when the total number
of spurious detections and missing events outnumbers the detection of actual events.
In Figure 4.7 below, the colour scale runs from dark red to dark blue, with dark
red corresponding to a percentage correct of 100%, while dark blue corresponds to a
negative percentage of around –100%. As can be seen, the larger the frequency overlap
and the smaller the relative amplitude of the second source the poorer the algorithm
performed. This is to be expected as the larger the overlap in frequency the greater the
opportunity there is for confusion between the two sources, resulting in
misidentifications. Also, the smaller the relative amplitude between the sources the less
likely the second source is to be detected correctly. The algorithm was found to give a
totally correct result in 58% of cases regardless of variations in prior subspace mismatch
and relative amplitude between frequency vectors within sources. Correctly detected
events were found to outnumber spurious detections in on average 91.5% of cases.
Further light can be shed on the performance of the algorithm by examining the
average number of detected events. Figure 4.8 shows the percentage of events actually
present that were detected, with dark red again corresponding to 100%, and dark blue
corresponding to a value around 58%. This shows that in the vast majority of cases the
events present were in fact correctly detected, and that in practically all cases the events
136
Drum Transcription Systems
related to the first source were correctly recovered. In fact, in 94.7% of the 2500 possible
combinations of frequency overlap and relative amplitude between sources, all the events
were correctly detected, regardless of variations in the other parameters. Again, the
higher the frequency overlap and the lower the relative amplitude the poorer the
algorithm performed, which is as expected. This shows that the algorithm performs very
well in detecting events present in the test signals.
137
Drum Transcription Systems
138
Drum Transcription Systems
139
Drum Transcription Systems
As can be seen from the above discussion, the PSA algorithm performs robustly
and gives the correct separation on the synthetic signals for a wide range of frequency
overlaps and relative amplitudes. This suggests that the algorithm performs well in many
different circumstances. It is interesting to apply the results of the synthetic tests to the
prior subspaces used when attempting to transcribe actual drums. Figure 4.12 shows the
main frequency regions of energy for the prior subspaces for both snare and bass drum,
with blue representing the snare and red representing the bass drum. These prior
subspaces represent in effect an “average” snare drum and an “average” bass drum and as
such approximate the difference between the two types of drum. As can be seen,
approximating the spectra of the main peak of energy of both snare and bass drums with a
hanning window is a reasonable first approximation.
Of greater interest is the amount of overlap between the prior subspaces. There is
a gap of 14 bins between the peaks, which is approximately equivalent to a frequency
overlap of 33% using the measure used in the synthetic tests. If the results of the
140
Drum Transcription Systems
synthetic tests hold for real signals then this suggests that the algorithm will perform
correctly in separating snares from bass drums for relative amplitudes of 0.28 or higher.
As snare and bass drums are usually of similar amplitude in drum loops the algorithm
should perform well in separating bass drums and snare drums. This is borne out in the
tests on drum loops which are discussed in the next section. With regards to hi-hats, the
main region of energy in the frequency spectrum has very little overlap with that of the
snare and bass drum. Therefore, it should be possible to separate hi-hats at lower relative
amplitudes. This is again borne out on the tests on actual drum loops.
Figure 4.12. Frequency Spectrum of prior subspaces for Bass drum and Snare Drum
Of course it is unlikely that the same results can be achieved on real drum signals
as those of the test results. Nevertheless, the synthetic tests do suggest that the PSA
algorithm can perform well under a wide range of circumstances provided that the
frequency overlap is not too large and/or the relative amplitude of the sources is not too
low.
141
Drum Transcription Systems
Subspace Mismatch
0 2 4 6 8 10
Relative Amplitude between Freq. Vectors
0 1 1 1 1 1 1
142
Drum Transcription Systems
To test the ability of the PSA method, a simple drum transcription algorithm was
implemented in MATLAB. See code\Psa\PSA.m on the attached CD for the
implementation. To allow direct comparison with sub-band ISA the same drum loops
used in testing sub-band ISA were used in testing PSA. As noted previously, the 15 drum
loops used contained hi-hats, snares and kick drums, and the drum patterns used were
commonly found patterns in rock and pop music.
In order to overcome the source signal ordering problem inherent in ICA, a
number of assumptions were made to allow identification of the sources. Firstly, it is
assumed that hi-hats occur more frequently than the other drums present. This
assumption holds for most drum patterns in popular music. Secondly, it is assumed that
the kick drum has a lower spectral centroid than the snare drum. These are the same
143
Drum Transcription Systems
assumptions as used with sub-band ISA, and so the only difference between the two
algorithms lies in the manner in which the mixture spectrogram is decomposed.
As a result of imperfect separation, the recovered amplitude envelopes are
normalised and all peaks over a set threshold are taken as an occurrence of a given drum.
The same threshold was used for all the test signals in both PSA and sub-band ISA to
allow for direct comparison of results. As with sub-band ISA, onset times were calculated
directly from the time of the peaks.
The results obtained for transcription using PSA are summarised in Table 4.4
below. Table 4.3 repeats the results obtained using sub-band ISA to allow comparison
between the two sets of results. As can be seen from the tables, the results for snares and
kicks are identical. It should be noted that the extra snares detected using PSA were as a
result of amplitude modulation rather than identifying kick drums as snares as was the
case with sub-band ISA. A change to the PSA transcription algorithm to take amplitude
modulations into account would likely eliminate these errors. PSA correctly detected
more of the hi-hats than sub-band ISA. In both cases, the undetected hats were separated
correctly but fell below the threshold for detection. The fact that PSA correctly identified
a greater number of hats suggests that using prior subspaces provides a better means to
detect hats than the blind methods of sub-band ISA. A number of snares were also
identified as hi-hats in both PSA and sub-band ISA. This is due to the high frequency
energy present in snare drums which can make the separation between snares and hats
difficult. The average error in detecting onsets was 10 ms for both PSA and sub-band
ISA. This is due mainly to the poor time resolution of the STFT.
Snare 21 0 2 90.5
Kick 33 0 0 100
Hats 79 6 6 84.8
144
Drum Transcription Systems
Snare 21 0 2 90.5
Kick 33 0 0 100
Hats 79 2 6 89.9
As demonstrated above, PSA is a practical method for attempting drum transcription and
separation. However, the transcription algorithm implemented above is designed to work
where drums only are present. In most pop songs the drums occur along with varying
numbers of pitched instruments. As noted in section 4.2.2, PSA only assumes that the
145
Drum Transcription Systems
sources searched for do not change in pitch over the course of the spectrogram. As
already stated, this is a valid assumption for most drum sounds. Therefore, PSA has the
potential to work in the presence of pitched instruments. However, a number of issues
must be addressed before PSA can be used to transcribe drums in the presence of pitched
instruments. These issues were first highlighted and solutions were proposed in
[FitzGerald 03b].
The first issue to be addressed is to note that the presence of a large number of pitched
instruments will cause a partial match with the prior subspace used to identify a given
drum. This causes interference in the recovered amplitude envelope, which can make
detection of the drums more difficult. However, it should be noted that pitched
instruments have harmonic spectra with resulting regions of low intensity between
overtones or partials. Furthermore, due to the rules of harmony used in popular music,
many of the pitches played simultaneously will be in harmonic relation to each other. As
a result, every time pitched instruments are played there will be regions in the spectrum
where little or no energy is present due to pitched instruments. It can then be appreciated
that good frequency resolution will reduce the interference due to the pitched
instruments, and as a result, improve the likelihood of recognition of the drums.
The use of sinusoidal modelling was also explored as a means of eliminating
some of the interference due to pitched instruments. Sinusoidal modelling was discussed
in detail in Section 2.3 and attempts to represent an audio signal as a sum of sinusoids
plus a noise component. This technique had previously been used by Sillanpää et al to
eliminate the effects of pitched instruments when attempting to transcribe drums in the
presence of pitched instruments, where they assumed that the pitched instruments were
removed by sinusoidal modelling and that the remaining noise component contained the
drum sounds [Sillanpää 00]. While a good initial approximation, it was noted by
Sillanpää et al that some of the energy of the drums was removed.
146
Drum Transcription Systems
Figure 4.13. Snare subspace from “I’ve been losing you” FFT size 512, hopsize 256
However, the use of such a technique proved problematic for a number of
reasons. Firstly, to get good removal of the sinusoids required the use of different
thresholds for the detection of sinusoids in different signals. What successfully removed
most of the pitched instruments in one example often failed to do so in another example.
This mitigates against its use in a fully automated drum transcription system. The second
problem was that the lower the threshold employed for detection of sinusoids, the greater
the amount of energy from the drums which was removed with the sinusoids. This
resulted in a trade-off between the removal of the sinusoids and the amount of energy left
in the noise component for use in drum transcription. In many cases the performance of
the drum transcription was actually found to degrade by performing sinusoidal modelling
as a pre-processing step prior to attempting transcription. In particular, the main peaks in
spectral energy of snares and kick drums were found to be removed, resulting in poorer
matching of the drum sounds with their respective prior subspaces. As a result of these
difficulties, the use of sinusoidal modelling was eliminated as an option for removing
interference due to pitched instruments.
147
Drum Transcription Systems
Figure 4.14. Snare subspace from “I’ve been losing you” FFT size 4096, hopsize 256
As can be seen from Figures 4.13 and 4.14, the interference due to other
instruments is greatly reduced by increasing the frequency resolution of the analysis, and
the drums are more easily identified at the higher frequency resolution. Reducing the
frequency resolution also reduces the ability of the prior subspace method to identify the
drum associated with the subspace. This is demonstrated by the fact that the bass drum
has a higher peak in the snare subspace than the snare itself when the frequency
resolution is reduced. However, the use of higher frequency resolution comes at a price, a
corresponding reduction in the time resolution of the signal, leading to inaccuracies in the
detected onset times.
Despite using high frequency resolution, it was found that the interference present
in the hi-hat subspace was in some cases considerably greater than that in the bass drum
or snare subspaces. This caused problems when trying to discriminate between genuine
hi-hat events and spurious events caused by interference due to the presence of pitched
instruments. This problem is illustrated in Figure 4.15, where, in some cases, the
interference has a normalised amplitude as large as some of the hi-hat events. This
148
Drum Transcription Systems
appears to be as a result of the fact that the hi-hat prior subspace has its energy spread out
over a greater range of the spectrum than the snare and kick drum. This makes it more
sensitive to the presence of pitched instruments.
149
Drum Transcription Systems
150
Drum Transcription Systems
the low frequency energy of the signal, resulting in improved identifiability for hi-hats.
Using a higher number of subspaces returns a PSD closer to that obtained from Welch’s
method. The effect of PSD normalisation is illustrated in Figure 4.16, which contains the
PSD normalised hi-hat subspace recovered from the same excerpt used to obtain Figure
4.15. The hi-hats are much more readily identifiable, and most of the spurious peaks
visible in Figure 4.15 have been eliminated.
The PSD normalisation process used can be described in pseudo-code as follows:
1. Carry out an STFT on the input signal.
2. Obtain a magnitude spectrogram from the magnitude of the STFT values.
3. Obtain the PSD of the input signal, P, using the eigenvector method, ensuring that
the PSD has the same number of frequency bins as the STFT.
4. for i = 1 : N , where N = number of spectrogram frames
Pframe(i) = Frame(i)./P, where ./ signifies elementwise division, Frame(i)
and Pframe(i) are the ith spectrogram frame and the ith PSD normalised
spectrogram frame respectively.
end
151
Drum Transcription Systems
In tests, PSD normalisatio`n proved to be a more robust method for recovering the
hi-hats than high-pass filtering, successfully reducing the interference in a greater number
of cases than high-pass filtering. This is because PSD normalisation takes into account
the characteristics of the signal being analysed and highlights the regions of lower power
density in the signal being analysed, as opposed to blindly filtering the signal at a preset
cutoff frequency.
Once the amplitude envelopes for each subspace have been calculated, the next stage of
the PSA algorithm is to carry out ICA on the amplitude envelopes to obtain a set of
independent amplitude envelopes that correspond to each of the drums present. In many
cases, ICA succeeded in recovering the separated amplitude envelopes, and transcribed
the drums successfully. However, in some cases, the ICA step failed catastrophically.
This was due to the interference present due to the presence of pitched instruments in the
signals and exposes a problem inherent in ICA in general. ICA attempts to find a set of
signals that are as statistically independent as possible. ICA will recover the correct
source signals if the input signals contain only mixtures of the source signals. A limited
amount of interference or noise in the input signals can be tolerated, but too much noise
results in the recovery of a set of independent signals that do not correspond to the source
signals. This is because the presence of too much noise means that the input signals do
not correspond closely enough to mixtures of the source signals we wish to recover.
In an effort to eliminate the effects of the interference due to pitched instruments
from the snare and bass drum amplitude envelopes, all values in the amplitude envelope
below a set threshold are set to zero. A normalised amplitude of 0.4 was found to be a
suitable threshold for both the snare and kick drum. This operation is not carried out on
the hi-hats as the interference has been eliminated in the PSD normalisation step.
However, carrying out ICA on the resulting amplitude envelopes was still not successful
in all cases. This was due to the fact that the thresholding operation on the snare and bass
drum left only very sharp peaks with large regions in the amplitude envelope containing
no activity whatsoever. This is not a true representation of what happens when a snare or
bass drum is played and eliminates part of the onset, and a large portion of the decay of
152
Drum Transcription Systems
the drums has also been lost. In contrast, the hi-hat amplitude envelope contains more
realistic onsets and decays and demonstrates virtually no regions of zero activity. When
these very different amplitude envelopes are input to an ICA algorithm the resulting
independent signals contain unusual artifacts, such as numerous, sudden, large amplitude
modulations. These modulations are in turn detected as events where none are present. In
an effort to eliminate this problem, it was decided to carry out ICA on only the snare and
bass drum amplitude envelopes, as they are comparable in that they both contain sharp
peaks and large areas of no activity. This resulted in the correct separation of bass drums
and snare drums in most cases. The hi-hat envelope is instead passed directly to the onset
detection algorithm. While carrying out the analysis in this manner gives good results in
general, it can result in extra errors in detection of hi-hats. As the hi-hat amplitude
envelope no longer undergoes ICA, the drum transcription algorithm loses the ability to
distinguish between a snare occurring on its own and a snare and hi-hat occurring
simultaneously. However, in many cases a hi-hat does occur simultaneously with the
snare, so this results only in a small reduction in the efficiency of the transcription
algorithm.
Unfortunately, the manipulations required to make transcription in the presence of
pitched instruments possible dramatically affects the quality of the re-synthesis. Using the
amplitude envelopes used for transcription to re-synthesise the drums results in very poor
quality reconstruction of the original sounds. This is because in the case of the snare and
kick drums a large amount of information has been lost in the attempt to remove the
interference in the signal. In the case of the hi-hats, using the amplitude envelope
obtained from PSD normalisation results in a re-synthesised signal that contains
significant amounts of low frequency information such as from the other drums and
pitched instruments.
To test the ability of PSA to transcribe drums in the presence of pitched instruments, a
drum transcription system was implemented in Matlab. This implementation can be
found in code\Psa\pitchPSA.m. The system implemented deals only with snares, bass
drums and hi-hats. Due to the source signal ordering problem in the ICA step, it is
153
Drum Transcription Systems
assumed that the bass drum has a lower spectral centroid than the snare. The system was
tested on real world examples consisting of 20 excerpts taken at random from pop songs.
These songs were chosen to cover as wide a range of styles as possible from pop to disco
and rock. The duration of these excerpts varied from 1-3 seconds, and the type of drum
patterns in the songs varied widely. The drum patterns from these excerpts were
transcribed by the listener. In cases of ambiguity, filtering and sinusoidal modelling were
used to remove some of the effect of the pitched instruments to allow the listener to be
more accurate in judging the presence or absence of a given drum.
Because of the imperfect separation of the ICA step, the amplitude envelopes
were normalised and onsets over a given threshold were taken to be a drum onset. A
threshold of 0.5 was used for both snare and kick drum, and a threshold of 0.1 was used
for the hi-hats. This lower threshold for the hi-hats reflects the fact that the amplitude of
the hi-hats in real world examples can vary widely as the drummer accentuates some hi-
hat events to create the “groove” or “feel” of the pattern. Onset times were determined in
the same manner as PSA for drums only. The results are outlined in Table 4.5 below. A
detailed breakdown of the results from each excerpt is given in Appendix 1.
Though the results demonstrate the effectiveness of PSA as a method for
transcribing drums in the presence of pitched instruments, a greater number of errors
occur than for PSA with drums only. Possible reasons for this are discussed with regards
to each drum below.
Type Total Undetected Incorrect % Correct
Snare 57 1 9 82.5
Kick 84 4 7 86.9
Table 4.5: Drum Transcription Results – PSA in the presence of pitched instruments.
In the case of the bass drums, six snare events were incorrectly identified as bass
drums. Interestingly, these errors occurred in excerpts where a “disco” style of drumming
was employed. In these excerpts the snare drum is less bright than in the other genres of
music, and so a greater chance of incorrect identification is the result. Only one of the
incorrect bass drum detections was as a result of a bass guitar note being identified as a
154
Drum Transcription Systems
bass drum. The missing four undetected bass drum events were visible on the amplitude
envelope of the excerpts in question, but were below the threshold for detection. The bass
drums at these points were audibly lower than the other bass drum event in the excerpts.
In the case of the snare drum, five of the incorrect snares were as a result of the
combination of a bass drum and a hi-hat occurring simultaneously being mistaken for
snares. This happened in two excerpts. The remaining errors occurred as a result of noise
due to pitched instruments.
With regards to the hi-hats, the majority of incorrect identifications were as a
result of interference that had not been eliminated in the PSD normalisation step. Other
errors brought to light an interesting problem. In two cases an event with the
characteristics of a hi-hat was clearly visible in both the spectrogram and the recovered
amplitude envelope, but no event of this type was audible to the listener. It may be that
these events are genuine hi-hat events that have been masked by other audio events, but
as there is no way of determining this for excerpts from commercial recordings, these
onsets have been classed as incorrect detections. In the case of the undetected hi-hats, the
majority of the hats were clearly visible in the amplitude envelopes, but fell below the
threshold required for identification. Further improvements in the results may be possible
by adjusting the thresholds for detection, but there is a trade-off between reducing the
number of incorrect identifications and increasing the number of missed events.
The results obtained compare favourably with those described by Virtanen in
[Virtanen 03]. Using equation 3.59 to convert the results shown in Table 4.5 to the error
rate used by Virtanen, then error rates of 15.2% and 12.1% are obtained for snare and
bass drum. This compares with the error rates of 43% and 27% for snare and bass drum
obtained by Virtanen. It should also be noted that the results obtained using the modified
PSA algorithm were obtained on excerpts from commercial recordings as opposed to
audio signals synthesised from General MIDI sound sets. Excerpts from commercial
recordings represent a more difficult challenge to transcribe from, as the use of reverb
and other effects serves to make transcription more difficult. Further, unlike the algorithm
described by Virtanen, the modified PSA algorithm is capable of transcribing hi-hats in
the presence of pitched instruments.
155
Drum Transcription Systems
Thus, despite the increased number of errors in comparison with the drums-only
case, PSA has proved to be a viable method for transcribing drums such as snares, bass
drums and hi-hats or cymbals in the presence of pitched instruments. However, the same
problems occur as with normal PSA when attempting to transcribe drums with large
frequency overlaps, such as snares and toms, or hi-hats and cymbals.
While PSA has proved effective in transcribing drums provided that there is not
significant frequency overlap between the sources, it cannot successfully transcribe
drums in cases where there is significant overlap. This causes problems when trying to
transcribe drums loops containing both snares and toms.
Fortunately, there is another source of information available which can be
exploited to improve the likelihood of a successful transcription in such circumstances.
Drum patterns typically consist of a small number of drum types which occur a number
of times to generate the drum pattern in question. Each time a given drum or combination
of drums occur then the frequency spectrum at that point will be similar to other
occurrences of that drum or combination of drums. It is proposed to exploit this repetition
of sources by automatically modelling each event that occurs in a given drum loop,
generating similarity measures between each event, and then grouping similar events
together. Grouping the events in such a manner narrows down the possibilities by giving
a better indication of the number of sources present and so provides another means of
obtaining information which can be used to help transcribe the drums. Although a very
simple idea, this type of automatic modelling and grouping procedure has never
previously been incorporated into a drum transcription system. The use of automatic
modelling and grouping for drum transcription purposes was first demonstrated in
[FitzGerald 03c].
As the ISA-type analysis has proved successful in generating prior subspaces, it is
proposed to use this type of approach to automatically model the events that occur in a
drum loop or drumming performance. To model each event individually, it is necessary
to identify when an event occurs. To this end, a spectrogram of the input signal is
multiplied by prior frequency subspaces for both snare and kick drum. The resulting
156
Drum Transcription Systems
amplitude basis functions are then normalised and all peaks above a set threshold are
taken to be a drum event. This is usually sufficient to identify all membrane drum onsets
including toms. The onset time of each event is then determined, and the sections of the
spectrogram between each event analysed individually.
Principal Component Analysis is performed on each section and the first
frequency principal component from each section retained. This is usually sufficient to
eliminate the effects of any metallic plate drums which overlap with the membrane
drums. This is because, as noted in the previous chapter, PCA is a variance based
procedure, and so is biased towards the loudest sources present in the signal. As the
membrane drums are usually mixed louder than the metal plate drums then this serves to
reduce to a large extent the effects of the metal plate drums.
157
Drum Transcription Systems
Figure 4.17 shows the similarity matrix obtained from analysing a drum loop
containing snare, kick drum and two different types of tom. Blue indicates that the events
are highly similar and red indicates regions of large dissimilarity. As can be seen events 1
and 3 are highly similar. These events correspond to occurrences of a kick drum. Events
2 and 4 correspond to occurrences of a snare drum Events 6 and 7 correspond to two
occurrences of one of the toms, and event 5 is the other type of tom that occurs. Event 5
is closer to the other type of tom drum than to the snare and kick drums. It can be seen
that the similarity matrix shows the correct grouping of the events.
To group the events, the following procedure was used. Starting from the first
event, all events with a Euclidean distance of less than one from the first event are
grouped together and removed from the list of events remaining ungrouped. This
threshold was arrived at by observing the distances obtained between the various drums
in a number of examples. It is assumed that each event can belong to only one group. The
next ungrouped event is then chosen and the procedure is repeated until all events have
membership of a group. In cases where each event represented only a single drum this
amounted to the correct transcription of the drum loop. However, this is not usually the
case. Typically, a hi-hat or ride cymbal will occur with a membrane drum such as snare,
kick or tom. In some cases the membrane drums will also occur simultaneously.
158
Drum Transcription Systems
are obtained for each of the groups, and all non-snare and kick events in the spectrogram
are masked. PSA is then performed on the resulting spectrogram and the snare and kick
drum events identified. The algorithm is still prone to errors from the overlap of toms
with other skinned drums, but this overlap is not a very common occurrence.
Power Spectral Density normalisation is then performed on the original
spectrogram to eliminate the effects of the membrane drums as much as possible. As
when transcribing drums in the presence of pitched instruments, the Power Spectral
Density is estimated using an eigenvector method. The PSD normalised spectrogram is
multiplied by a prior hi-hat subspace. This is sufficient to recover all metallic plate drum
events, such as hi-hats and cymbals. However, traces of both snare and tom drum events
will also appear in the resulting amplitude envelope, which could be detected as a plate
drum where none is present. To overcome this overlap, kick drum events are masked in
the original spectrogram, and the resulting spectrogram is multiplied by a snare frequency
subspace. ICA is then performed on the resulting amplitude envelope and that of the hi-
hat subspace. All events above a threshold in the resulting hi-hat envelope are then taken
as metallic drum events.
Automatic grouping is then carried out on the metallic plate drum events.
However, due to interference from other drums no simple threshold suffices for grouping
the drums. To overcome this and set an approximate threshold for the drums, a histogram
of the distances is obtained. The lower edge of the first histogram bin with no entry is
taken as the threshold. Events are then grouped as before using this threshold. If two
large groups occur that do not overlap in time then both hi-hat and cymbal are taken to
occur within the loop, and these groups are kept separated. Otherwise, all events are
grouped together. The justification for this is that most drummers tend to stay on either
hi-hat or ride cymbal for long periods, usually only changing when the piece or song
changes from one section to another, such as from verse to chorus. It is rare to hear a
drummer alternating between hi-hat and ride events in the course of a bar of music. As a
result, if overlapping groups occur, it is most likely to be the same metallic drum that has
been grouped into a number of groups due to interference from skinned drums. However,
as a result of this grouping strategy the algorithm is unable to detect the presence of either
crash cymbals or open hi-hats. Also, at present the transcription algorithm has no means
159
Drum Transcription Systems
of distinguishing between hi-hats and ride cymbals, and so the groups are labelled
metallic drums 1 and 2.
The drum transcription algorithm was tested on 25 drum loops, with the number of
different drums (including different types of tom) in the loops ranging from three to
seven drums. Again, the drums were obtained from sample CDs and were chosen to
cover as wide a spread of drum sounds within a given drum type as possible. A wide
variety of different drum patterns and drum fills were used. The tempos used ranged from
150bpm to 80bpm and different meters were used, including 4/4, 3/4 and 12/8. The
relative amplitudes between the drums varied from 0 dBs to –24 dBs to make the tests as
realistic as possible. The same analysis parameters were used on all test signals. The
results are summarised in Table 4.6, with the percentage correctness again calculated as
per equation 4.1.
Snare 40 0 0 100
Kick 64 3 1 93.8
Toms 31 3 4 77.4
160
Drum Transcription Systems
were as a result of incorrect separation of the metallic and snare/tom subspaces. In cases
where both hi-hat and ride cymbal were present in the same loop, the drums were
grouped correctly together.
The automatic grouping performed remarkably well on the skinned drums. All
events passed to the grouping stage were in fact correctly grouped, with any errors in the
transcription process occurring elsewhere in the algorithm. This demonstrates the
effectiveness of the grouping methodology as a tool for drum transcription.
Comparing the results obtained with those of Paulus [Paulus 03], an error rate of
10.3% is obtained. This is significantly better than the lowest error rate of 49.7%
obtained by Paulus, especially in light of the fact that no form of rhythmic knowledge,
such as the use of N-grams and prior probabilities, was incorporated into the system.
However, it should be noted that the system described by Paulus can deal with more than
one type of cymbal, as well as attempting to classify any miscellaneous percussion
instruments which occur to a general percussion classification.
The combination of automatic modelling and grouping in conjunction with PSA
extends the number of drums that can successfully be transcribed in a given drum loop.
However, there are still some open issues that have to be resolved. Firstly, the system
cannot deal with overlaps between toms and other membrane type drums such as snare
and bass drum. While a relatively uncommon occurrence, it still remains an issue to be
addressed. Also to be addressed is the issue of dealing with the occurrence of open hi-
hats and crash cymbals, which at present the algorithm cannot detect. Also, as a result of
the automatic modelling stage, which involves running one PCA step per event detected
the algorithm is considerably slower than PSA or even sub-band ISA.
Nevertheless, the drum transcription system proposed above marks an
improvement over previous systems in that it has been evaluated, with defined results
available showing good performance in a wide range of circumstances. It should also be
noted that these results were achieved without any form of rhythmic modelling or
incorporating models of common drum patterns, and that once grouping and separation
has taken place, the use of simple heuristics is usually sufficient to enable the drums to be
identified.
161
Drum Transcription Systems
It has been shown that combining the abilities of redundancy reduction approaches, in
particular ISA, with the model based approaches used in previous drum transcription
systems can produce systems which overcome some of the problems previously
encountered in drum transcription systems. This hybrid approach also overcomes some of
the limitations of the purely blind separation methods such as ISA, namely the problem
of estimating how much information to retain in the dimensional reduction stage of ISA.
A number of approaches that allowed the incorporation of prior knowledge into
redundancy reduction based methods were proposed and investigated. Firstly the use of
sub-band pre-processing, with the sub-bands tailored to separate information relating to
the membrane drums from the plate drums was proposed. This was found to be effective
in transcribing mixtures of snare, bass drum and hi-hats or ride cymbals.
A more efficient and elegant means of incorporating prior knowledge was then
proposed. The resulting technique, Prior Subspace Analysis, made use of prior models of
the frequency spectra of the sources of interest to allow decomposition of a spectrogram
without the use of PCA. This had two main benefits and advantages over ISA and sub-
band ISA. Firstly, it avoided the bias towards louder sounds inherent in using PCA to
decompose a spectrogram, and so overcame the problem of estimating the optimal
number of components to keep from the dimensional reduction stage. Secondly, it relaxed
the assumption that the entire spectrogram had to be stationary in pitch, instead limiting
the stationarity assumption to the sources being searched for. As was noted, this is a valid
assumption for drum sounds. This relaxation allows PSA to attempt to transcribe drums
in the presence of pitched instruments.
The robustness of the PSA method was investigated through the use of synthetic
test signals and the method was found to be robust in a range of conditions which could
reasonably be expected to occur in many real world situations. However, the synthetic
tests also highlighted a potential weakness in the method, namely that it was prone to
failure if the frequency overlap of the main regions of energy in the frequency spectra
between sources was too high.
PSA was then tested on a number of drum loops and was found to be effective in
transcribing drums provided that the frequency overlap between the sources was not too
162
Drum Transcription Systems
large. This result was in line with that expected from the synthetic test results. In practical
terms, this meant that PSA was able to robustly transcribe mixtures of snares, bass drums,
and hi-hats or cymbals. As these are the most commonly occurring drum types this means
that PSA is useful for drum transcription in many circumstances.
Next, the transcription of drums in the presence of pitched instruments using PSA
was investigated. It was found that a number of modifications were required to allow
successful transcription to occur. The use of high frequency resolution was required to
ameliorate the effects of partial matches to the prior subspace due to the presence of
pitched instrument. While the use of high frequency resolution combined with a
thresholding method was found to be sufficient for the identification of snare and bass
drums, it was found to be less effective in allowing the detection of hi-hats or cymbals.
As a result, Power Spectral Density normalisation was used as a means to create a high-
pass filter specifically tailored to the signal being analysed. This allowed recovery of the
hi-hats or cymbals. The resulting transcription algorithm was found to be effective in
transcribing snare, bass drum and hi-hats or cymbals in the presence of pitched
instruments, though with some degradation in performance when compared to the drums
only case.
To overcome the problem of dealing with sources with large degrees of frequency
overlap in their main regions of resonance, the use of automatic modelling and grouping
was proposed. This allowed drum transcription to be successfully carried out on a larger
number of drums, such as several types of tom and both hi-hats and cymbals in the same
loop. However, there are still limitations and areas which remain to be addressed, namely
that the system currently assumes that toms do not overlap with other membrane drums,
and that the system cannot identify crash cymbals and open hi-hats. Also, to date attempts
to extend this method to work in the presence of pitched instruments have not met with
success. Despite these limitations, the system proposed represents an advance on
previous systems in that it has been evaluated and demonstrated to work in a wide range
of circumstances.
163
Re-synthesis of Separated drum sounds
Having demonstrated new techniques for the transcription of drum sounds in the previous
chapter, it should be noted that techniques such as PSA do allow re-synthesis of the
separated sounds to be attempted. However, the re-synthesis quality obtained from PSA
is in many cases quite poor. The reason for this lies in the fact that the PSA method
assumes that drum sounds can be approximated by the outer product of a single
frequency vector with a single time vector. While this assumption is sufficient for
transcription purposes, it also means that large amounts of information related to the
timbre of each drum sound has been discarded.
This can be demonstrated by noting that the first set of outer products returned
after carrying out SVD on the spectrogram of a drum sound contains the largest amount
of the total variance that is possible for a decomposition of a given drum sound into a
sum of outer products. Therefore, this represents the upper limit in the amount of
information that can be retained in a single set of outer products, and so is the upper limit
on the possible amount of information contained in the vectors obtained using PSA.
To put a figure on this upper limit, SVD was carried out on large numbers of
snares, kick drums and hi-hats. To ensure the same number of components was obtained
from each sample the length of each of the samples was set at one second, and the same
parameters were used when carrying out the STFT on each sample. A normalised
cumulative sum of the singular values obtained was then calculated for each example.
This was calculated from:
1 ρ
φρ = ∑i =1σ i (5.1)
∑i =1σ i
n
where σi is the singular value of the ith component, φρ is the cumulative sum for
component ρ and n is the total number of components. For snares, on average 47.8% of
the total variance was contained in the first set of components, 55.2% for kick drums and
49% for hi-hats. This represents a considerable amount of discarded information which
results in poor re-synthesis of the separated drum sounds. This is further demonstrated in
Figure 5.1, which shows a plot of the average proportion of variance retained versus the
164
Re-synthesis of Separated drum sounds
number of components for snare, kick drum and hi-hat. It can then be seen that a means
of increasing the amount of information retained is necessary to obtain improved re-
synthesis. A number of methods of doing this are discussed in this chapter.
A larger amount of information relating to a given source can be obtained by carrying out
ISA with a larger number of components than the three components obtained using PSA.
This is especially the case for sources with large amplitudes, such as is usually the case
for both snare and kick drum. This was demonstrated in Figure 3.12 where it can be seen
that there are four components that are visibly related to the kick drum and six
components related to the snare. It should also be noted that some information relating to
the hi-hats has been recovered, though this information is mainly related to those hi-hats
which are not overlapped by the kick drum and snare. This extra information could be
used to improve the re-synthesis of the transcribed drum sounds.
165
Re-synthesis of Separated drum sounds
166
Re-synthesis of Separated drum sounds
by the transcription algorithm. This is demonstrated in Figure 5.2 which shows six
components related to a snare drum separated from a drum loop. The components shown
have been normalised to between 0 and 1. In this particular case there is a gap of 8 frames
between the actual onset and the last of the components associated with the snare, though
wide gaps have been observed.
167
Re-synthesis of Separated drum sounds
the components are the output of the transcription process, in effect “transcribing” the
transcription. This results in an n x m matrix where n is the number of transcribed drums
and m is the number of components. As noted in section 4.3 this measure is quite severe
in punishing spurious detections, resulting in low and possibly negative scores for
components whose events do not match those of the drum of interest.
To cluster the components to the sources, any component with a score of 0.5 or
greater for a particular drum sound is taken as belonging to that drum sound. If after this
a component has been allocated to more than one source then the source with the highest
score is chosen as the actual source. This ensures that a component cannot be allocated to
more than one source. Carrying out clustering in this way means that there may be a
number of components that are not allocated to any source. This is justifiable in that the
components that are not allocated usually turn out to be noisy in the sense that the
information contained cannot be clearly stated as belonging to a given source. Re-
synthesis of the separated sources was then carried out as described previously in section
3.3. It was discovered that for drum sounds the re-synthesis was more realistic sounding
when reusing the phase information from the original spectrogram than when using the
spectrogram inversion technique described in Section 3.3. See code\resynth\transclust.m
for the MATLAB implementation of the above clustering algorithm.
To test the effectiveness of the clustering algorithm, the algorithm was tested on
the same 15 signals used to test PSA. The number of components retained in ISA was 15,
and the correct clustering was identified by an observer. The clustering method was
found to work very well for both snare and kick drum with a score of 97.7% obtained for
the clustering of components to the kick drum, and 95.2% for snare components. Half of
the incorrect classifications were found to be due to components with onsets outside the
20 frame window, and the remainder were due to components which were felt by the
observer to be too noisy being included by the clustering algorithm. The algorithm
performed less well for the hi-hats. This is again as a result of the limitations of ISA in
dealing with sources of low relative amplitude, as was discussed in section 3.3.2. It was
found that, even after retaining 15 components, in over half of the cases there was no
component which could be identified with the hi-hats. In these cases, better re-synthesis
of the hi-hats can be obtained using the single frequency vector and time vector obtained
168
Re-synthesis of Separated drum sounds
using PSA, even though the re-synthesis quality will still be poor due to the limitations on
the amount of information that can be recovered by a single pair of vectors as was
described at the start of this chapter. This is because PSA can specifically target the
sources of low relative amplitude, as opposed to trying to blindly extract them in the
manner of ISA.
As the use of the transcription results neatly overcomes the clustering problem
LLE-based ISA [FitzGerald 03] can be used in an attempt to better recover sources of
low relative amplitude. In many cases, this does result in improved recovery of these
sources, but there are still some problems with the re-synthesised sources. In particular,
LLE-based ISA was found to recover well examples of hi-hats and cymbals which are not
overlapped by other drum events in many cases where standard ISA recovers nothing.
Also of interest is the fact that the kick drums recovered using LLE-based ISA contain
less high-frequency information, indicating that LLE-based ISA is better at separating hi-
hats from kick drums. Unfortunately, this comes at the expense of reduced power and
brightness in the recovered snares and toms.
The clustering method can also be used to overcome the problem of attempting
separation and re-synthesis that occurred when adapting PSA to work in the presence of
pitched instruments. The transcription can be carried out as described in section 4.5 and
then standard ISA performed on the excerpt. It was found that retaining 7-20 components
often resulted in a number of components which could be ascribed to the snare and kick
drum. The transcription results can be used to identify any components that relate to a
given transcribed drum. However, the number of components required to obtain
components which could be ascribed to these drums was found to vary from signal to
signal, thus making it difficult to set up a system for the automatic source separation of
drums in the presence of pitched instruments. Also, as expected with using ISA, the
method only works for sources of high relative amplitude, i.e. snare and kick drum, and
the re-synthesis quality is much lower than that achieved in the drums only case. The re-
synthesis obtainable is shown in Figure 5.3, which shows the results obtained when
separating snare and kick drum from an excerpt from “Easy Lover” by Phil Collins and
Philip Bailey. The resynthesised sources can be found in Appendix 2 on the
accompanying CD.
169
Re-synthesis of Separated drum sounds
It is interesting to note that the drum sound source separation scheme has now
become a two-stage process in the same manner as the sinusoidal sound source separation
schemes described in [Virtanen 01b], namely transcription first, and then sound source
separation and re-synthesis based on the transcription results. It can be seen that carrying
out transcription first results in the availability of extra information to guide the source
separation method.
It should be noted, however, that despite extra information to guide the clustering,
the sound quality of the re-synthesised drums can still vary from example to example.
Also, in many cases the method does not recover sources of low relative amplitude such
as hi-hats. Even the use of LLE-based ISA does not in all cases capture these sounds, and
even then the recovery of overlapped low amplitude sources is still poor. When
transcription based clustering fails to re-synthesise these sounds correctly a different
approach to re-synthesis is necessary. Other possible re-synthesis techniques for these
sources are described below.
Figure 5.3: Original excerpt and separated snare and kick drums
170
Re-synthesis of Separated drum sounds
171
Re-synthesis of Separated drum sounds
before the recorded onset of a drum event. This offset from the detected onset was
included as the use of overlapping windows lead to the smearing of the actual onset over
a number of frames, and also in an attempt to overcome any discrepancies between the
actual event onset and the detected onset. The mask returned to zero 30 frames after the
mask began, or 3 frames before the start of the next snare or kick drum event, whichever
was the smaller. This proved to capture the majority of the drums sounds, though in some
cases the end of the drum was lost. While increasing the size of the region where the
mask has a value of 1 would eliminate this, it would be at the expense of increased noise
getting though in other cases.
Figure 5.5. Snare drum waveform with and w/o binary masking
The binary masking was found to eliminate large amounts of extraneous noise
from the signals for both snare and kick drum. Figure 5.5 shows the output waveform of a
separated snare drum both with and without masking. The relevant audio examples can
be found in Appendix 2. It can be seen that large amounts of extraneous noise have been
eliminated. Binary masking was found to be less useful in the case of hi-hats or cymbals
recovered from the clustering stage. This is because these drums tended to occur more
frequently than the snare and kick drum, and so the mask remained on for a greater
172
Re-synthesis of Separated drum sounds
proportion of the signal. Also, as was noted in the previous section, the recovery of the
hi-hats using the clustering of independent components was often poor to begin with.
Binary masking was also found to be effective in reducing noise present in drums
recovered in the presence of pitched instruments.
The use of binary masking is not just limited to the elimination of noise from the
separated drum sounds. As was previously discussed in section 2.6.3, binary masking can
be used to separate sound sources provided that the time-frequency representations of the
sources do not overlap. This condition is known as W-disjoint orthogonality (W-DO),
and was used as a means of separating speech signals in [Rickard 01]. The algorithm
described needed two input signals to estimate the parameters necessary to obtain the
binary time-frequency masks. It was also noted by Rickard et al that source separation
was theoretically possible with one input signal, but that at present there was no way of
estimating the binary masks needed to do so.
In the case of drums transcribed from audio signals the transcription details can be
used in certain circumstances to generate approximate binary masks for sources. As
drums such as hi-hats and ride cymbals tend to occur with greater frequency than snares
and kick drums it follows that there will often be occasions where a hi-hat or ride cymbal
occurs without another drum occurring at the same time. In such cases creating a binary
mask in the manner described above, but using the original spectrogram instead of a re-
synthesised spectrogram, can successfully isolate an example or examples of the drum in
question. Re-synthesis of the drum in question can then be attempted by copying the
example to regions where the drum has been detected, but occurs simultaneously with
another drum. This will result in a number of drums of the same amplitude, which will
not provide the same rhythmic feel as the original drums, which will have variations in
amplitude depending on when played. In an attempt to achieve this the amplitudes of the
copied drums are scaled by the ratio of the amplitude of the original drum to that of the
copied drum. This amplitude information is available from the output of the transcription
algorithm. See code\resynth\resynthplate.m on the accompanying CD for a MATLAB
implementation of the binary masking resynthesis algorithm.
173
Re-synthesis of Separated drum sounds
This method can also be used on any drum that does not have another drum
occurring simultaneously, and the binary masking was found to work quite well provided
that the drums occur far enough apart so that the decay of another drum does not still
sound under the start of another drum. This means that this method is most suited for
lower tempos and where the smallest distance between events is a half-beat as opposed to
a quarter-beat. Nevertheless, even in cases where the decay can still be heard, the
reduction in the presence of noise from other drums is still considerable.
174
Re-synthesis of Separated drum sounds
Figure 5.7. Spectrogram of hi-hats recovered using binary masking and amplitude scaling
175
Re-synthesis of Separated drum sounds
Figure 5.8 shows the waveforms obtained for the separation of the drum loop
shown in Figure 5.6 using transcription-based clustering for snare and kick drum
combined with binary masking for the hi-hats. As can be seen, the re-synthesised
waveform from the combined separated sources has captured the essence of the original
waveform, while the sources have been successfully separated. This shows that the
overall quality of re-synthesis using the combined methods can be quite good. It should
be noted that the plots of the separated sources have been normalised in the figure. The
resynthesised sources can again be heard in Appendix 2 on the accompanying CD.
In cases where all the hi-hats and ride cymbals occur simultaneously with another
event then the re-synthesis of these sources becomes more problematic as there will be no
time region of the spectrogram where an example of the source can be extracted.
Nevertheless, the use of binary time frequency masks still has use in such situations.
Again, the fact that the transcription is available and that the sources have been identified
allows the extraction of useful information for re-synthesis. In most drum loops where all
the hi-hats or cymbals are overlapped with another drum there will be at least one
instance where an overlap occurs with a kick drum only. As previously demonstrated in
Figure 4.3 the kick drum contains most of its energy in the lowest part of the spectrum.
Therefore, for a large region of the spectrum the effects of the kick drum can be
considered negligible and the bins of the spectrogram can be considered as belonging to
the hi-hat or ride cymbal. In effect, the sources can be considered approximately W-DO
for large regions of the frequency spectrum. Therefore, the upper part of the frequency
region will contain information which can be used to re-synthesise these sources. Testing
showed that setting a cutoff frequency of 5000 Hz eliminated practically all traces of the
kick drum. Therefore, creating a time-frequency binary mask which covers a region in
time, as was described previously, but only retaining bins with frequency above 5000Hz
will allow recovery of a high-pass filtered version of the hi-hat.
There remains the problem of estimating the lost information from the lower
regions of the spectrogram. A simple way to do this is to use the lower regions of the
spectrogram recovered from the PSA time and frequency vectors. This results in a more
realistic re-synthesis across the entire frequency spectrum, though the sound quality is
poorer than in the unoverlapped case.
176
Re-synthesis of Separated drum sounds
Also investigated was the use of a binary mask to eliminate one of the sources in
the original spectrogram and then passing the resulting spectrogram to the ISA algorithm.
This was done to see if extra information on the sources of low relative amplitude could
be obtained by eliminating one of the sources of high relative amplitude, such as the
snare. However, this was found to give little or no improvement in the recovery of the
low amplitude sources and so was eliminated as an option for re-synthesis.
Unfortunately, the binary masking techniques for sound source separation of
drums described above cannot be extended to work in the presence of pitched
instruments. This is because, while there may be drums which are unoverlapped by other
drums, it cannot be guaranteed that the drums will not be overlapped by the presence of a
number of pitched instruments.
5.3 Conclusions
Having shown the limitations of PSA for the re-synthesis of separated sound sources, a
number of options for improved re-synthesis of the transcribed drums were discussed and
implemented. Transcription based clustering is shown to result in improved re-synthesis
when compared to that obtained using ISA (both standard and LLE-based) with the
original clustering method and also that obtained from PSA. This is particularly true for
sounds of high relative amplitude such as snare and kick drums. The results obtained for
sources of low relative amplitude such as hi-hats were found to vary from signal to
signal. In such cases where standard ISA fails to capture sources such as the hi-hats it
was found that the re-synthesis based solely on the PSA time and frequency vectors is
often better than that obtainable from standard ISA. In many such cases LLE-based ISA
was found to perform better, in many cases giving relatively good recovery of low
amplitude events when not overlapped with louder events. Transcription based clustering
was also shown to be useful in separating drums from the presence of pitched
instruments, though at the expense of a loss in sound quality over the drums-only case.
The problem of estimating how many components are required for separation again rears
its’ head in such cases, making fully automatic sound source separation of drums in the
presence of pitched instruments problematic.
177
Re-synthesis of Separated drum sounds
178
Conclusions And Future Work
179
Conclusions And Future Work
frequency masking to enable recovery of sources of low amplitude which may not have
been recovered adequately using subspace methods.
Also presented were two novel reformulations of ISA, the first using a technique
called Locally Linear Embedding for dimensional reduction instead of Principal
Component Analysis. This is shown to be capable of capturing sources of low amplitude
better than the variance based approach of PCA. A reformulation of ISA to achieve
independence in both time and frequency, instead of either time or frequency
individually, was also demonstrated, though this was found to give little or no
improvement over the standard ISA model.
The above work represents a considerable advance in tackling the problem of
polyphonic percussion transcription, and has overcome many of the problems inherent in
previous systems. We have demonstrated that simple models of the sources in
conjunction with a statistical approach to source separation can lead to successful
transcription of polyphonic percussion music, and that the transcription can be used to
obtain reasonable quality re-synthesis of the individual sources through the use of
subspace methods and binary time-frequency masking.
Despite the success of the systems described in this thesis there remains a number of
open issues and future directions for research in the transcription of polyphonic
percussive music.
The systems described are designed to work on a limited subset of percussion
instruments, and an obvious direction for future work is the extension of the methods
proposed to deal with increased numbers of different types of percussion instruments.
The systems described herein use simple heuristics and rules to identify most of the
sources as part of the transcription process and as the number of sources increases it
becomes more difficult to use such rules to identify the sources. To allow further
expansion of the techniques developed would require the incorporation of some formal
method of percussion instrument recognition, such as that described by Herrera et al for
single sources in [Herrera 03]. It is envisaged that such a scheme would function best
after the re-synthesised sources have been obtained, as the re-synthesised sources contain
180
Conclusions And Future Work
more information than the automatic models generated as part of transcription. Such a
system would also allow discrimination between open and closed hi-hats, as well as
between different types of cymbal, a feature which is currently not incorporated into the
transcription algorithms described in this thesis.
The techniques described in this thesis have been shown to be capable of
transcribing snare, kick drum, and hi-hat or cymbal in the presence of pitched
instruments. Extending this ability to include the remaining drums dealt with in this thesis
would be a first step towards further generalisation of the transcription of percussion
instruments in the presence of pitched instruments. While the extension of the automatic
modelling and grouping technique to deal with the presence of pitched instruments has
to-date proved difficult, it is felt that an improved means of automatic grouping would
prove beneficial in extending this approach.
With regard to source separation it is felt that the incorporation of a “non-
negative” constraint in the Independent Component Analysis (ICA) stage of the
separation algorithms would be useful in obtaining better results from the transcription
process. Work on incorporating such a constraint into ICA has been carried out in
[Plumbley 01], and it is felt that the incorporation of such a constraint would help to
eliminate errors in the transcription process. This is because the ICA algorithms used in
this thesis sometimes give results which, while capturing a general description of the
source, can also include aspects which are implausible in real world situations. The use of
ICA with a non-negative constraint would help eliminate some of these potential sources
of error, and may possibly lead to better quality re-synthesis of the separated sources.
At present the algorithms described in this thesis are all implemented in Matlab. A
useful area for future work would be the implementation of these algorithms in C++
which would result in a considerable reduction in the time required to run the algorithms.
Also, as implemented at present, the algorithms require a given signal to be processed in
batch mode. However, the use of an on-line ICA algorithm, such as described by Amari
in [Amari 96] would potentially allow PSA in particular to be implemented in real-time.
At present, however, the automatic modelling and grouping methods still require the use
of batch processing.
181
Conclusions And Future Work
182
Appendix 1: Drum Transcription Results From Song Excerpts
183
Appendix 1: Drum Transcription Results From Song Excerpts
184
Bibliography
Bibliography
[Abdallah 01] Abdallah S.A. and Plumbley M.D. “If edges are the independent
components of natural images, what are the independent components of
natural sounds?”, Proceedings of the International Conference on
Independent Component Analysis and Blind Signal Separation (ICA2001),
San Diego, California, December 9-13, 2001. pp. 534-539, 2001.
[Abdallah 02] Abadallah, S.A. “Towards Music Perception by Redundancy Reduction
and Unsupervised Learning in Probabilistic Models”, Ph.D. Thesis, King’s
College London 2002.
[Abdallah 03] Abdallah, S.A. and Plumbley, M.D. “An Independent Component
Analysis Approach to Automatic Music Transcription”, 114th AES
Convention, Amsterdam March 2003.
[Amari 96] Amari, S., Cichocki, A., and Yang, H. H. (1996). “A New Learning
Algorithm for Blind Signal Separation”, Advances in Neural Information
Processing Systems 8, Editors D. Touretzky, M. Mozer, and M. Hasselmo,
MIT Press, Cambridge MA.
[Amari 98] Amari, A. “Natural gradient works efficiently in learning.” Neural
Computation, 10(2) pp. 251-276, 1998
[Atick 90] Atick, J.J. and Redlich, A.N. “Towards a theory of early visual
processing”, Neural Computation 2, pp. 308-320, MIT Press, Cambridge,
MA, USA 1990.
[Attneave 54] Attneave F. “Informational aspects of visual perception”, Psychol. Rev. 61
pp. 183–93 1954
[Bach 02] Bach, F.R. and Jordan, M.I. “Kernel Independent Component Analysis”,
Journal of Machine Learning Research 3(2002) pp. 1-48 2002.
[Bailly 98] Bailly, G., Bernard, E., and Coisnon, P. “Sinusoidal modelling.”, Cost258
Workshop, Vigo, Spain, November 1998.
[Barlow 59] Barlow, H. (1959) “Sensory mechanisms, the reduction of redundancy,
and intelligence”, National Physical Laboratory Symposium No.10, The
Mechanization of Thought Processes.
185
Bibliography
186
Bibliography
[Casey 02] Casey, M.A. “Generalized Sound Classification and Similarity in MPEG-
7”, Organized Sound, 6:2, 2002.
[Cemgil 00] Cemgil, A., Kappen, B., Desain, P., and Honing, H. “On tempo tracking:
Tempogram representation and Kalman Filtering” Proc. International
Computer Music Conference, 2000.
[Chafe 86] Chafe, Jaffe. (1986). “Techniques for Note Identification in Polyphonic
Music”. Proceedings of the International Conference on Acoustics Speech
and Signal Processing, 1986.
[Chechik 01] Chechik, G, Globerson, A., Tishby, N., Anderson, M., Young E, and
Nelken, I. “Group Redundancy Measures reveal Redundancy Reduction in
the Auditory Pathway”, NIPS 2001
[Comon 94] Comon, P., “Independent component analysis - a new concept?” Signal
Processing, 36 pp. 287_314, 1994
[DeLathauwer 99] DeLathauwer, L., Comon, P., DeMoor, B. and Vandewalle, J.
“ICA algorithms for 3 sources and 2 sensors”, IEEE Sig. Proc. Workshop
on Higher Order Statistics June14-16,1999, pp116-120.
[Depalle 97] Depalle, Ph. & Hélie, T. “Extraction of Spectral Peak Parameters Using a
Short-Time Fourier Transform And No Sidelobe Windows”, IEEE 1997
Work-shop on Applications of Signal Processing to Audio and Acoustics”.
Mohonk, New York, 1997.
[Desainte 00] Desainte-Catherine, M. and Marchand, S. “High-Precision Fourier
Analysis of Sounds Using Signal Derivatives”, Journal of Acoustic
Engineering Society, Vol 48(7), July/August 2000.
[Ding 97] Ding, Y. and Qian, X., “Processing of Musical Tones Using a Combined
Quadratic Polynomial-Phase Sinusoid and Residual (QUASAR) Signal
Model” , J. Audio Eng. Soc., Vol 45, No. 7/8, July/August 1997.
[Doval 93] Doval B. and Rodet X., “Fundamental Frequency Estimation and Tracking
using Maximum Likelihood Harmonic Matching and HMMs.” Proc.
IEEE-ICASSP 93. pp. 221-224 1993.
[Dubnov 98] Dubnov, S. and Rodet, X. “Timbre Characterisation and Recognition with
Combined Stationary and Temporal Features” Proc. ICMC 98:
187
Bibliography
188
Bibliography
189
Bibliography
190
Bibliography
[Hofmann 97] Hofmann, T., and Buhmann, J.M., “Pairwise data clustering by
deterministic annealing.”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(1) pp. 1-14,1997
[Hoyer 02] Hoyer, P.O., “Non-negative sparse coding” Neural Networks for Signal
Processing XII (Proc. IEEE Workshop on Neural Networks for Signal
Processing), pp. 557-565, Martigny, Switzerland, 2002.
[Huber 85] Huber, P.J. “Projection pursuit.” The Annals of Statistics, 13(2) pp. 435-
475, 1985
[Hyvärinen 99] Hyvärinen, A. “Fast and robust fixed-point algorithms for independent
component analysis”, IEEE Transactions on Neural Networks 10(3) pp.
626-634, 1999.
[Hyvärinen 99a] Hyvärinen, A. “Survey on independent component analysis.” Neural
Computing Surveys ,2, pp. 94_128, 1999.
[Hyvärinen 00] Hyvärinen, A. and Hoyer,P. “Emergence of phase and shift invariant
features by decomposition of natural images into independent feature
subspaces” Neural Computation, 12(7) pp. 1705-1720, 2000
[Hyvärinen 00a] Hyvärinen, A. and Oja, E. “Independent Component Analysis:
Algorithms and Applications”, Neural Networks, 13(4-5) pp. 411-430,
2000.
[Hyvärinen 00b] Hyvärinen, A.. “New approximations of differential entropy for
independent component analysis and projection pursuit.”, Advances in
Neural Information Processing Systems, volume 10, pp. 273_279, 2000
[Jolliffe 86] “Principal Component Analysis”, Springer-Verlag, New York, 1986
[Jørgensen 01] Jørgensen, M., http://www.daimi.au.dk/~pmn/spf02/CDROM/pr4/
[Kashino 95] Kashino, Nakadai, Kinoshita, andTanaka. “Application of Bayesian
probability network to music scene analysis”, Proceedings of the
International Joint Conference on AI, CASA workshop, 1995.
[Kendall 87] Kendall's advanced theory of statistics Vol.1, Distribution theory, 5th
Edition, by Alan Stuart and J. Keith Ord, 1987.
[Klapuri 98] Klapuri, A. “Automatic Transcription of Music”, M.Sc. thesis, Tampere
University of Technology. 1998
191
Bibliography
192
Bibliography
[Master 03] Master, A. “Sound source separation of N sources from stereo signals via
fitting to N models each lacking one source”, Stanford University EE391
Winter Report 2003
[McAulay 86] McAulay and Quatieri “Speech analysis/synthesis based on a sinusoidal
representation”, IEEE Trans. on Acoustics, Speech and Signal Processing,
34(4), pp. 744-754, 1986.
[McAuley 95] McAuley, J. “Perception of Time as Phase: Towards an Adaptive-
Oscillator Model of Rhythmic Pattern Processing”, PhD thesis, Indiana
University, 1995.
[Moore 97] Moore B., Glasberg B. and Baer T. “A Model for the Prediction of
Thresholds, Loudness, and Partial Loudness”, J. Audio Eng. Soc., Vol. 45,
No. 4, pp. 224–240. April 1997
[Moorer 77] Moorer. J. “On the Transcription of Musical Sound by Computer”,
Computer Music Journal, Nov. 1977.
[Olshausen 96] Olshausen, B. A., and Field, D. J. “Emergence of simple-cell receptive
field properties by learning a sparse code for natural images,” Nature, vol.
381, pp. 607–609, 1996.
[Opolko 87] Opolko, F. and Wapnick, J. “McGill University Master Samples”
(compact disk). McGill University, 1987.
[Orife 01] Orife, I. “Riddim: A rhythm analysis and decomposition tool based on
Independent Subspace Analysis”, M.A. Thesis, Dartmouth College,
Hanover, NH,USA, 2001.
[Patterson 90] Patterson, R. D. and Holdsworth, J. “A functional model of neural activity
patterns and auditory images”, Advances in speech, hearing and language
processing vol. 3, ed. W. A. Ainsworth, JAI Press, London. 1990
[Paulus 03] Paulus J. and Klapuri A. “Conventional and Periodic N-grams in the
Transcription of Drum Sequences”, Proc. of IEEE International
Conference on Multimedia and Expo (ICME03), Baltimore, USA, pp.
737-740, 2003.
[Plumbley 01] Plumbley, M. D. “Adaptive Lateral Inhibition for Non-negative ICA”
Proceedings of the International Conference on Independent Component
193
Bibliography
194
Bibliography
[Schloss 85] Schloss, W.A. “On the Automatic Transcription of Percussive Music –
From Acoustic Signal to High Level Analysis”, PhD thesis, University of
Stanford,1985
[Serra 89] Serra. X. “A system for sound analysis/transformation/synthesis based on
a deterministic plus stochastic decomposition”, Ph.D. thesis, Stanford
University. 1989
[Seppänen 01] Seppänen, J. “Tatum Grid Analysis of Musical Signals”, Proc. of IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics,
New Paltz, New York, Oct. 21-24, 2001.
[Sillanpää 00] Sillanpää, J., Klapuri, A., Seppänen, J. and Virtanen, T. “Recognition of
acoustic noise mixtures by combining bottom-up and top-down
processing”, Proc. European Signal Processing Conference, EUSIPCO
2000
[Sillanpää 00a] Sillanpää, J. "Drum Stroke Recognition”
http://www.cs.tut.fi/sgn/arg/music/drums/raportti.ps
[Shannon 49] Shannon C E and Weaver W (ed) “The Mathematical Theory of
Communication”, 1949 (Urbana, IL: University of Illinois Press)
[Slaney 96] Slaney, M. “Pattern Playback in the 90s”, Advances in Neural Information
Processing Systems 7, MIT Press, 1996.
[Smaragdis 97] Smaragdis, P. “Information Theoretical Approaches to Source
Separation”, Masters Thesis, MIT Media Lab, 1997.
[Smaragdis 01] Smaragdis, P. “Redundancy reduction for computational audition, a
unifying approach.”, PhD thesis, MIT Media Lab, 2001
[Smith 99] Smith, L. “A Multiresolution Time-Frequency Analysis And Interpretation
of Musical Rhythm”, PhD thesis, University of Western Australia, 1999.
[Stautner 83] Stautner, J.P. “Analysis and Synthesis of Music using the Auditory
Transform”, Masters Thesis, MIT EECS Department, 1983.
[Stone 99] Stone, J.V. and Porill, J. “Regularisation Using Spatiotemporal
Independence and Predictability”, NIPS 1999
[Subhash 96] Subhash, S. “Applied Multivariate Techniques” John Wiley & Sons 1996.
195
Bibliography
[Vaseghi 00] Vaseghi, Saeed V. “Advanced Digital Signal Processing and Noise
Reduction”, 2nd ed. John Wiley & Sons Ltd. pp. 270-290. 2000
[Verma 00] Verma, T. and Meng, T. “Extending Spectral Modelling Synthesis with
Transient Modelling Synthesis”, Computer Music Journal, 24:2 pp. 47-59,
Summer 2000.
[Virtanen 01] Virtanen, T. “Audio Signal Modelling with sinusoids plus noise”, M.Sc.
thesis Tampere University of Technology, 2001.
[Virtanen 01a] Virtanen, T. “Accurate Sinusoidal Model Analysis and Parameter
Reduction by Fusion of Components”, AES 110th convention,
Amsterdam, Netherlands, May 2001.
[Virtanen 01b]Virtanen, T. and Klapuri, A. “Separation of Harmonic Sounds Using
Multipitch Analysis and Iterative Parameter Estimation”, WASPAA 2001
[Virtanen 02] Virtanen, T. and Klapuri, A. “Separation of Harmonic Sounds Using
Linear Models for the Overtone Series”, ICASSP 2002
[Virtanen 03] Virtanen T. “Sound Source Separation Using Sparse Coding with Temporal
Continuity Objective”, Proc. of International Computer Music Conference
(ICMC2003), Singapore, 2003.
[Virtanen 03a] Personal communication with the author.
[Viste 02] Viste, H., and Evangelista, G., “An extension for source separation
techniques avoiding beats”, Proceedings of 5th International Conference on
Digital Audio Effects, Hamburg Germany, 2002, pp. 71-75.
[Walmsley 99]Walmsley, P., Godsill, S. and Rayner, P. “Bayesian Modelling of
Harmonic Signals for Polyphonic Music Tracking”, Cambridge Music
Processing Colloquium, September 1999
[Walmsley 99a] Walmsley, P., Godsill, S. and Rayner, P. “Polyphonic Pitch Tracking
Using Joint Bayesian Estimation of Multiple Frame Parameters”, IEEE
Workshop on Applications of Signal Processing to Audio and Acoustics
October 1999
[Walmsley 99b] Walmsley, P., Godsill, S.and Rayner, P. “Bayesian Graphical Models
for Polyphonic Pitch Tracking”, Diderot Forum, Vienna, Dec 1999.
196
Bibliography
197