0% found this document useful (0 votes)
7 views7 pages

Piece Identification in Classical Piano Music Without Reference Scores

This paper presents a method for identifying classical piano music pieces from short audio excerpts without requiring reference scores. The approach involves compiling a reference database of performances obtained from online sources, which are then transcribed and processed using a symbolic fingerprinting algorithm to facilitate audio matching. The system automates the creation of the reference database and improves identification accuracy by increasing redundancy and employing a preprocessing step to select suitable performances.

Uploaded by

jlgultraboom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Piece Identification in Classical Piano Music Without Reference Scores

This paper presents a method for identifying classical piano music pieces from short audio excerpts without requiring reference scores. The approach involves compiling a reference database of performances obtained from online sources, which are then transcribed and processed using a symbolic fingerprinting algorithm to facilitate audio matching. The system automates the creation of the reference database and improves identification accuracy by increasing redundancy and employing a preprocessing step to select suitable performances.

Uploaded by

jlgultraboom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PIECE IDENTIFICATION IN CLASSICAL PIANO MUSIC WITHOUT

REFERENCE SCORES

Andreas Arzt, Gerhard Widmer


Department of Computational Perception, Johannes Kepler University, Linz, Austria
Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria
[email protected]

ABSTRACT instrumentation, ornamentation and other performance as-


pects. Regarding classical music, the identification of per-
In this paper we describe an approach to identify the formances that derive from a common musical score is of
name of a piece of piano music, based on a short audio ex- special interest, as in general there exists a large number
cerpt of a performance. Given only a description of the of performances of the same piece (and new renditions are
pieces in text format (i.e. no score information is pro- performed every day).
vided), a reference database is automatically compiled by This task is generally called audio matching (or, mostly
acquiring a number of audio representations (performances in the context of popular music, cover version identifica-
of the pieces) from internet sources. These are transcribed, tion, see e.g. [14]). A common approach to solve this prob-
preprocessed, and used to build a reference database via a lem is to use an audio alignment algorithm. This is com-
robust symbolic fingerprinting algorithm, which in turn is putationally expensive, as it basically involves aligning the
used to identify new, incoming queries. The main chal- query snippet with every position within every audio file in
lenge is the amount of noise that is introduced into the the database (see [12], and [11] for a indexing method that
identification process by the music transcription algorithm makes the problem more tractable). Furthermore, due to
and the automatic (but possibly suboptimal) choice of per- the coarse feature resolution of these algorithms, relatively
formances to represent a piece in the reference database. large query sizes are needed.
In a number of experiments we show how to improve the As there exist efficient fingerprinting algorithms, it
identification performance by increasing redundancy in the seems natural to try to adapt them to the problem of cover
reference database and by using a preprocessing step to version identification. A first study towards this is pre-
rate the reference performances regarding their suitability sented in [9], where the authors focused on the suitabil-
as a representation of the pieces in question. As the results ity of different low-level features as a basis for fingerprint-
show this approach leads to a robust system that is able ing algorithms, but neglected the problem of tempo dif-
to identify piano music with high accuracy – without any ferences between performances. In [1] an extension to a
need for data annotation or manual data preparation. well-known fingerprinting algorithm [17] is proposed that
makes it invariant to the global tempo. With the help of
1. INTRODUCTION an audio transcription algorithm for piano music (see [5])
a system was built that, given a short audio query, almost
Efficient algorithms for content-based audio retrieval en- instantly returns the corresponding (symbolic) score from
able systems that allow users to browse and explore music a reference database – despite the fact that audio transcrip-
collections (see e.g. [10] for an overview). In this con- tion is a very hard problem and thus introduces a lot of
text audio fingerprinting algorithms which permit the fast noise in the process.
identification of an unknown recording (as long as an al- In this paper we show how to use this algorithm in the
most exact replica is contained in the reference database) absence of symbolic scores to identify unknown perfor-
play an important role. For this task there exist highly effi- mances, using a reference database based on other perfor-
cient algorithms that are in everyday commercial use (see mances of the pieces in question. As symbolic scores are
e.g., [3, 6, 13, 15–17]). often not readily available, this increases the applicability
However, these algorithms are not able to identify dif- of this algorithm in real life systems. The downside of this
ferent performances of the same piece of music, as they approach is that now audio transcription is used for both
are not designed to work in the face of musical variations the data contained in the reference database and for the
such as different tempi, expressive timing, differences in queries, which introduces even more noise. Furthermore,
the transcription algorithm we are using is optimised on pi-
ano sounds, which for now limits the proposed system to
c Andreas Arzt, Gerhard Widmer. Licensed under a Cre-
piano music only.
ative Commons Attribution 4.0 International License (CC BY 4.0). At-
tribution: Andreas Arzt, Gerhard Widmer. “Piece Identification in Clas- We are going to describe this approach in the context of
sical Piano Music Without Reference Scores”, 18th International Society a system geared towards fully automatic identification of
for Music Information Retrieval Conference, Suzhou, China, 2017. classical piano music, in the sense that even the creation
of the collection of audio recordings, which is needed to and the piece, and adding the word “piano”, to ensure that
perform the identification task, is automated. The moti- mainly piano performances are returned.
vation for this is to reduce the amount of costly manual Next, the collected recordings are fed into a Music
annotation to a minimum, and instead facilitate available, Transcription Algorithm that takes the audio files and
albeit noisy, web sources like YouTube 1 or Soundcloud 2 . transcribes them into series of symbolic events. For this
The main challenge in this setting is the noise introduced step we rely on a well known neural network based method
into the identification process via multiple processes (auto- presented in [5], more specifically the version that is avail-
matic retrieval of reference performances, audio transcrip- able as part of the Madmom library [4]. As input it takes a
tion of reference performances, and audio transcription of series of preprocessed and filtered STFT frames with two
the query). In the paper we will show how to deal with different window lengths. The neural network consists of
this amount of noise by increasing redundancy in the ref- a linear input layer with 324 units, three bidirectional fully
erence database and by an automatic selection strategy for connected recurrent hidden layers with 88 units, and a re-
the reference performances. gression output layer with 88 units, which directly repre-
The paper is structured as follows. Section 2 gives an sent the MIDI pitches. The output of the transcription al-
overview of the proposed system. Then, in Section 3 the gorithm is a list of detected musical events, represented by
data we are using for our experiments is described. Sec- their pitches and start times. For details we refer the reader
tions 4, 5, 6 and 7 describe the core experiments of the to [5]. This algorithm exhibits state of the art results for
paper, showing that our approach is robust enough to cope the task of piano transcription, as was demonstrated at the
with the multiple sources of noise and performs well in our MIREX 2014 3 . Still, polyphonic music transcription is a
experiments. A brief outlook on possible improvements very hard problem, and thus the output of this transcription
and applications is given in Section 8. algorithm contains a relatively large amount of noise, of
which the following components need to be robust to.
The Automatic Preprocessing step is concerned with
2. SYSTEM OVERVIEW
the question of which of the downloaded recordings for
In this section we are going to describe the piece identifi- each piece should be used in our fingerprint database. In
cation system that will be used throughout the paper. The this paper we discuss three setups: take the top match re-
main goals of the system are 1) to automate the process of turned by the web crawler (see Section 4), take the top five
compiling a reference database, thus making manual anno- / fifteen matches returned by the web crawler (see Section
tations obsolete, and 2) based on this reference database, 5), and download 30 recordings for each piece, rank them
allow for robust and fast piece identification. Figure 1 de- automatically via comparing them to each other and use
picts how the components interact with each other. the top recordings identified via this approach (see Section
The system is based on a Database Definition file, 6). This means that in the latter two experiments a single
which is a list of pieces that are to be included in the piece is represented by multiple recordings, adding redun-
database. On this list each piece is represented by an ID, dancy to the reference database.
the name of the composer and the name of the piece, in- The transcribed sequences of symbolic event informa-
cluding identifiers like the opus number (see Figure 2 for tion, i.e. sequences of pairs (pitch, onset time), are fed to
an excerpt of the list). We would like to emphasise once the Tempo-invariant Symbolic Fingerprinter, to build a
more that this is the only input our system needs (in ad- database of fingerprints that later on can be used to iden-
dition to a source from which the recordings can be re- tify queries. The algorithm is used as described in [1],
trieved). All the data necessary to perform the identifica- thus it will be summarised here very briefly. The princi-
tion task is then prepared automatically. This also means ple idea of the fingerprinting algorithm is to represent an
that extending the database is as easy as adding a new line instance (in this case a transcribed performance, represent-
to the text file, describing the new piece. The data in this ing a piece) via a large number of local, tempo-invariant
file also defines the granularity of the database. For ex- fingerprint tokens. These tokens are created based on the
ample, movements of a sonata could be represented as in- pitches of three temporally local note events, together with
dividual pieces or combined as single piece – for our ex- the ratio of their distances in time. Due to the way they are
periments we took the latter approach. For our proof-of- created, the tokens are invariant to the global tempo, and
concept implementation we settled for 339 piano pieces of can be stored in a hash table and efficiently queried for.
well-known composers (Mozart, Beethoven, Chopin, Scri- An incoming Query is processed in the same way as
abin, and Debussy), which already represents a substantial above by the Music Transcription Algorithm. The re-
share of the classical piano music repertoire. sulting sequence of symbolic events is used to query the
A Web Crawler takes this list of pieces and retrieves Tempo-invariant Symbolic Fingerprinter for matches.
audio recordings of performances of the pieces. In our To do so, from the query the same kind of fingerprint to-
case we use a simple crawler for YouTube (an alternative kens are computed, and matching tokens are retrieved from
would be to use Soundcloud, amongst others). The queries the fingerprint database. Finally, in this result set continu-
are constructed by concatenating the name of the composer ous sequences of matching tokens, which are a strong in-
1 https://www.youtube.com 3 http://www.music-ir.org/mirex/wiki/2014:
2 https://soundcloud.com MIREX2014_Results
Web Crawler Music Transcription Algorithm Automatic Preprocessing
Database Definition
Crawl Web Source for Audio Transcribe Recordings Automatically Identify Suitable
List of Pieces (Text)
Recordings (e.g. YouTube) (Performances of Pieces) Performances

Query Results
Tempo-invariant Symbolic
Name of the Piece,
Fingerprinter
corresponding to the Query

Query
Music Transcription Algorithm
Audio Snippet of an Unseen
Transcribe Query
Performance of a Piece

Figure 1. System Overview

ID ; Composer ; P i e c e Piece ID Performance ID Time in Ref. Score


...
1 7 ; M o z a r t ; P i a n o S o n a t a No . 17 i n B−f l a t m a j o r K 570
1 8 ; M o z a r t ; P i a n o S o n a t a No . 18 i n D m a j o r K 576 1 0 99 351
1 9 ; M o z a r t ; F a n t a s y No . 1 w i t h Fugue i n C m a j o r K 394
2 0 ; M o z a r t ; F a n t a s y No . 2 i n C minor , K 396
1 0 21 292
... 1 4 16 109
4 1 ; B e e t h o v e n ; P i a n o S o n a t a No . 1 4 , Op . 2 7 , No . 2 ” M o o n l i g h t ”
4 2 ; B e e t h o v e n ; P i a n o S o n a t a No . 1 5 , Op . 28 ” P a s t o r a l ” 1 4 15 36
...
1 6 8 ; Chopin ; Mazurka Op . 7 No . 5 i n C m a j o r 1 4 148 36
1 6 9 ; Chopin ; N o c t u r n e Op . 15 No . 1 i n F m a j o r
1 7 0 ; Chopin ; N o c t u r n e Op . 15 No . 2 i n F−s h a r p m a j o r
1 4 150 32
1 7 1 ; Chopin ; N o c t u r n e Op . 15 No . 3 i n G m i n o r 10 48 368 7
...
2 8 1 ; Debussy ; L 1 1 3 , C h i l d r e n ’ s C o r n e r , D o c t o r G r a d u s ad P a r n a s s u m 1 0 239 7
2 8 2 ; Debussy ; L 1 1 3 , C h i l d r e n ’ s C o r n e r , Jimbo ’ s L u l l a b y
...
3 3 2 ; S c r i a b i n ; P i a n o S o n a t a No . 3 , Op . 23
3 3 3 ; S c r i a b i n ; P i a n o S o n a t a No . 4 , Op . 30
... Table 1. An example of a result returned by the fingerprint-
ing algorithm. This query was performed on a database in
which multiple reference performances represent a piece of
Figure 2. An excerpt of the file used for collecting the
music, hence for the piece with ID 1 results for two perfor-
database.
mances are returned. The score is the number of matching
fingerprint tokens for the given query at the specific time
dication that the query matches a specific part of a piece in the reference recording. For our purposes we summarise
stored in the fingerprint database, are identified (via a fast, the results per piece, i.e. the matching score for the piece
histogram based approach). with ID 1 is 863, and for the piece with ID 10 it is 7.
The Query Result is a list of positions within the refer-
ence performances that were inserted into the database (see
Table 1). The positions in the result set are ordered by their matically downloaded data that is used to build the refer-
number of tokens matching the query. As can be seen, the ence database later on. In total 370 tracks were selected
result set is actually more detailed than necessary for our and assigned manually to the respective pieces (roughly
applications scenario, as we are only interested in identify- 30 hours of music, or 665 000 transcribed events). Some
ing the respective piece, and not a specific reference perfor- of the tracks were assigned to the same piece, as e.g. the
mance (or even a position within reference performance). movements of the sonatas are typically represented as dif-
Thus for the experiments in this paper we summarise all ferent audio tracks, but are represented as a single piece in
occurrences of a piece into one score by summing up the our database.
matching scores of all its occurrences in the results set. The experimental setup is as follows. We are going to
use the same set of randomly extracted queries for each ex-
periment. We are using three query lengths of 2, 5 and 10
3. GROUNDTRUTH DATA AND EXPERIMENTAL
seconds (we only took queries though which had at least 10
SETUP
transcribed notes, avoiding to e.g. query for silence), and
For the experiments presented in this paper, ground truth extract for each length ten queries for each ground truth
data, i.e. performances for which the composer and the performance (giving a total of 3 700 queries for each query
name of the piece is known, is needed. We are using com- length). The experiments are based on different strategies
mercial recordings of a large part of the pieces contained to automatically compile the reference database. We start
in our database. This includes e.g. Uchida’s recordings of with a simple baseline approach (Section 4) and then grad-
the Mozart Sonatas, Brendel’s recordings of the Beethoven ually improve on it by introducing redundancy and a selec-
Sonatas, Chopin recordings by Arrau, Pires and Pollini, tion strategy (Sections 5 to 7).
and Debussy recordings by Pollini, Thibaudet, Zimerman. As evaluation measure we use the Recall at Rank k 4 .
We would like to emphasise that to get realistic results, 4 We would like to note that the related measure Precision at Rank k
in our experiments we made sure manually that no exact is not useful in our experimental setup, as there will only be at most one
replicas of these performances are contained in the auto- correct result in the result set.
Query Length Query Length
2s 5s 10 s 2s 5s 10 s
Recall at Rank 1 0.28 0.38 0.46 Recall at Rank 1 0.58 0.69 0.74
Recall at Rank 5 0.34 0.45 0.54 Recall at Rank 5 0.72 0.84 0.90
Recall at Rank 10 0.35 0.47 0.55 Recall at Rank 10 0.74 0.86 0.92
Mean Reciprocal Rank 0.30 0.41 0.48 Mean Reciprocal Rank 0.64 0.77 0.84
Mean Query Time 0.13 s 0.41 s 0.92 s Mean Query Time 0.34 s 0.81 s 2.49 s

Table 2. Results of the baseline approach. The results are Table 3. Results on the reference database based on mul-
based on 3 700 queries for each query length. tiple recordings (the top five results according to the web
source) to represent each piece. The results are based on
3 700 queries for each query length.
This is the percentage of queries which have the correct
corresponding piece in the first k retrieval results. In our
experiments we look at the recall at ranks 1, 5 and 10. In Query Length
addition, we also report the Mean Reciprocal Rank (MRR). 2s 5s 10 s
Recall at Rank 1 0.76 0.87 0.91
|Q|
1 X 1 Recall at Rank 5 0.84 0.94 0.97
MRR = (1) Recall at Rank 10 0.86 0.95 0.98
|Q| i=1 ranki
Mean Reciprocal Rank 0.80 0.90 0.94
Here, ranki refers to the rank position of the correct re- Mean Query Time 0.82 s 2.85 s 6.08 s
sult for the ith query.
The mean query times (i.e. the mean time it takes to
process a single query) given in the tables are based on a Table 4. Results on the reference database based on multi-
desktop computer on a single core 5 . If needed, the compu- ple recordings (the top fifteen results according to the web
tation could easily be sped up by multi-threading the query source) to represent each piece. The results are based on
process. 3 700 queries for each query length.

4. BASELINE APPROACH 5. USING MULTIPLE INSTANCES PER PIECE


The baseline approach is very straightforward. The web A simple way to improve the performance of the system is
crawler is used to download the top result from the web to increase the redundancy within the reference database.
source for each piece on the list. The downloaded audio Instead of relying on a single instance (recording) for each
files are transcribed and then processed by the fingerprint- piece in the reference base, each piece is represented by
ing algorithm to build the reference database, i.e. in the multiple recordings. For the first experiment five perfor-
reference database each piece is represented by one perfor- mances per piece were downloaded using the web crawler.
mance. Note that due to the automatic process the database The performances were processed in the same way as for
can be quite noisy, as some of the pieces might be incom- the baseline approach in Section 4 above and inserted into
plete (e.g. only a single movement of a piece), represented the fingerprint database. Then, on this database the same
by more than the actual piece (if e.g. the performance set of queries were performed. As described in Section 2,
downloaded for the piece also contains other pieces, like a the match score of a piece is computed by summing up the
recording of a full concert), or the representation is wrong scores of the performances representing the piece in ques-
(if the top result of the web crawler is actually a perfor- tion (also see Table 1).
mance of some other piece).
Table 3 shows the results of this experiment. As can
The generated fingerprint database is queried via the
be seen, the increased redundancy leads to a substantial
prepared excerpts of the collected ground truth data (see
increase in identification results, compared to the baseline
Section 3). The results of this first experiment can be seen
(see Table 2). The added redundancy increases the chances
in Table 2. As can be seen, already in this scenario and
that for each piece at least one “good” performance (in the
despite the small query sizes the method gives reasonable
sense of corresponding to the piece and relatively easy to
results. For queries of length ten seconds the algorithm re-
transcribe) is contained in the reference database, and thus
turns the correct name of the piece in close to 50% of the
mitigates the problems caused by noise, at least to some
cases. A closer look at the results though showed that the
extent.
main problem with this simplistic approach is that, as ex-
pected, for many pieces the representation in the database For an additional experiment we increased the number
is not correct or incomplete. This problem is tackled in the of performances to fifteen per piece. These results are
following sections. shown in Table 4. This improved the results even further.
The downside of adding more instances to the fingerprint
5 Intel Core i7 6700K 4 GHz with 32 GB RAM. database is a significant increase in computation time.
Query Length of candidate performances. To do so, the performances are
2s 5s 10 s transcribed and inserted into a new fingerprint database.
The intuition is that for a query extracted from the same
Recall at Rank 1 0.54 0.68 0.74
set of candidate performances (that actually matches the
Recall at Rank 5 0.63 0.76 0.83
piece), the fingerprinter will likely return three kinds of re-
Recall at Rank 10 0.64 0.78 0.85
sults. Firstly, the top result will be the performance the
Mean Reciprocal Rank 0.58 0.72 0.78
query was taken from. This is a perfect fit for all tokens,
Mean Query Runtime 0.14 s 0.47 s 0.97 s
which results in the maximum score. Secondly, a number
of other performances will probably also have a high score,
Table 5. Results on the reference database based on the top identifying them as being based on the same piece and
recording selected via the proposed strategy to represent as being transcribed in sufficient quality. Thirdly, perfor-
each piece. The results are based on 3 700 queries for each mances that actually belong to a different piece, or which
query length. are transcribed poorly, will score very low.
Based on these observations, we designed the process
Query Length of ranking the performances regarding their suitability to
2s 5s 10 s represent the piece in question as follows. For each of
the performances ten queries are randomly extracted (for
Recall at Rank 1 0.72 0.85 0.89 our experiments we used a query length of ten seconds)
Recall at Rank 5 0.82 0.92 0.96 and processed by the fingerprinting algorithm. As in all
Recall at Rank 10 0.84 0.93 0.97 other experiments, the results are summarised on the per-
Mean Reciprocal Rank 0.77 0.88 0.92 formance level (i.e. match scores of positions within the
Mean Query Time 0.49 s 1.71 s 3.83 s same performance are summed up). Then, for each result
the score of the top match (i.e. of the performance the
query stems from) is stored, this performance is removed
Table 6. Results on the reference database based on mul-
from the result set, and the remaining matching scores are
tiple recordings (top five recordings selected via the pro-
normalised by dividing by the top match score. The rea-
posed strategy) to represent each piece. The results are
soning behind this is that the absolute scores depend on the
based on 3 700 queries for each query length.
particulars of the query (foremost the length in the sense of
the number of notes, but also e.g. if the part in question is
6. AUTOMATICALLY SELECTING SUITABLE normally played in a steady tempo or is subject to expres-
REPRESENTATIONS sive tempo changes, which makes it harder to detect and
leads to a lower score).
A closer look at the results so far shows that increasing the
This results in 300 preprocessed and normalised result
redundancy in the reference database indeed leads to bet-
sets. The suitability of a performance to represent the piece
ter results, but also increases the computation time. The
in question is computed by summing up all the scores of
main problem with our approach is that in addition to use-
all its occurrences in the result sets. The higher this value
ful data, the process also adds a lot of extra noise to the
is for a performance, the more it has in common with the
fingerprint database. The web crawler returns a consid-
other performances assigned to the piece in question.
erable number of performances of the wrong piece, per-
formances played on a different instrument, and perfor- Based on this ranking we repeat experiments from Sec-
mances recorded in very bad quality. This kind of data tions 4 and 5, but this time for each piece we select the top
increases the runtime and decreases the identification ac- one or top five performances, respectively, according to the
curacy. In this section we present a method for identifying computed rank within the candidate set for each piece. The
performances in a given a set of candidates for a piece that results are shown in Tables 5 and 6, which should be com-
most probably are related to the piece in question, which pared to Tables 2 and 3, respectively. As can be seen the
also enables us to discard performances that most proba- selection strategy increases the identification performance
bly are noise. In this way we try to reduce the number for both scenarios and for all query lengths.
of stored fingerprint tokens, which generally decreases the A comparison of Tables 6 and 4 shows that by using
computation time, while still achieving good identification the proposed selection strategy a lower number of perfor-
performance. mances (5 versus 15) is sufficient to achieve comparable
Thus, for each piece we perform the following process identification accuracy. The decreased number of tokens
to select appropriate representations. First, 30 recordings also results in roughly half the computation time.
are downloaded via the web crawler. With a high probabil- The runtime actually depends on a number of factors,
ity at least some of these are actually piano performances most importantly the size of the fingerprint database. But
of the piece we are looking for, while the others might have of similar influence is the actual number of tokens that are
nothing in common. The idea now is to find a homoge- returned by the fingerprint database for a specific query.
nous group within this set of candidates. To identify per- The reason is that each of these tokens has to be processed
formances which are part of this group, we again employ individually to come up with the matching score. This also
the symbolic fingerprinting process, but limited to the set means that queries for pieces which are represented in the
database by a large number of performances will actually Querylength
take longer to compute – a further argument in favour of 2s 5s 10 s
the selection strategy presented in this section.
Recall at Rank 1 0.92 0.95 0.95
Recall at Rank 5 0.98 0.99 0.99
7. USING MULTIPLE QUERIES PER Recall at Rank 10 0.99 1 1
PERFORMANCE Mean Reciprocal Rank 0.94 0.97 0.97
Mean Query Time 0.49 s 1.71 s 3.83 s
So far the assumption was that we only have access to a
single short query of two to ten seconds. If instead we have
access to a full recording, just querying for one short query Table 7. Results for querying for a whole performance
would be a suboptimal approach. Thus, we tried an addi- via ten random small queries with ten seconds each. The
tional query strategy on the reference database based on results are based on 3 700 queries for each query length.
the performance selection strategy from Section 6 above.
A standard approach for processing long queries (in
the automatic compilation of the reference database. Addi-
this case a whole performance) would be to apply shin-
tionally, this increases the robustness of the identification
gling [2,7,8], i.e. splitting longer queries into shorter, over-
process via the fingerprinting algorithm, as ’problematic’
lapping ones and track the results of these sub-queries over
sections (e.g. regarding the transcription process) are rep-
time. Here, as proof of concept we use an even simpler
resented multiple times, thus increasing the chances that
method: we select ten random queries from the piece we
the parts in question are well covered by the reference
want to identify, process them individually and sum up the
database.
results. This can be seen as adding redundancy (relying
There exist a number of possible improvements regard-
on multiple queries instead of a single one) on the query
ing the automatic selection of performances for a piece. In
side. We perform this experiment on the reference database
our implementation the focus is on increasing the homo-
based on the top five selected recordings via the proposed
geneity within the group of performances for a piece by
strategy. The results are shown in Table 7. As can be seen
comparing them to each other. An additional option is to
this again considerably improves the results, and we are
analyse matches on the full reference database and try to
getting very close to 100%. The main cause for this is that
find out which performances match well to multiple pieces
the retrieval precision heavily depends on the quality of
and exclude them (as they cover multiple songs or were
the transcription. Some parts of a performance are much
mistakenly assigned to multiple pieces by the crawler).
harder to transcribe than others (e.g. heavily polyphonic
We are currently in the process of collecting a much
parts with a lot of sustain pedal, which are difficult to tran-
larger collection of classical piano music. This dataset will
scribe correctly). Using multiple queries, randomly dis-
contain a few thousand pieces, covering a large part of the
tributed over the whole performance, increases the chances
classical piano repertoire 6 . On this dataset we are going
that at least some parts are transcribed in good quality, and
to conduct experiments regarding the scalability of our ap-
that together these queries enable high retrieval accuracy.
proach in terms of runtime and retrieval accuracy.
Finally, we had a closer look at the few performances
In the future, we will also investigate the usefulness of
that were still misclassified and identified two problems.
the presented approach for non-classical piano music. Pre-
Our approach does not take care of the problem of record-
liminary experiments have shown that this is a much harder
ings of full concerts. If included in the reference database
task, as compared to classical piano music the pieces are
for multiple pieces, these will lead to misclassifications.
not as strictly defined via a detailed score (e.g. popu-
Furthermore, for some pieces only a small number of per-
lar songs and jazz standards are mostly described via lead
formances exists, which causes the crawler to return “sim-
sheets). Thus, performances of the same piece differ more
ilar” but wrong performances (e.g. performances of other
heavily than in classical music. Of course we would also
pieces of the same composer). We sketch a possible solu-
like to lift the restriction to piano music and try our method
tion to these problems in Section 8 below.
on other genres, but thus far general music transcription is
not robust enough to be used with our approach. Hopefully
8. CONCLUSIONS AND FUTURE WORK this will change in the future.
Finally, regarding real-world applications, an automatic
In this paper we presented an approach towards piece method to determine which pieces are well covered by
identification for performances of piano music, based on the database, and which ones would benefit from man-
an automatically compiled reference database using web ual intervention, would be desirable. This would help to
sources. It is shown that the symbolic fingerprinting quickly build a reference database which already covers
method is robust enough to deal with the noise introduced most pieces well, and then to manually add additional ref-
by the transcription algorithms and allows for fast query- erences (based on performances, or even on symbolic score
ing in the symbolic domain. Furthermore, increasing the data) for pieces the identification algorithm struggles with.
redundancy by using multiple performances to represent a 6 The reference database is of course compiled automatically (based
single piece, especially using the proposed selection strat- on the list of pieces), but the preparation of the ground truth for the ex-
egy, largely alleviates the problem of noise introduced by periments is a time consuming, manual process.
9. ACKNOWLEDGEMENTS [10] Peter Grosche, Meinard Müller, and Joan Serrà. Au-
dio content-based music retrieval. In Meinard Müller,
This work is supported by the European Research Council
Masataka Goto, and Markus Schedl, editors, Mul-
(ERC Grant Agreement 670035, project CON ESPRES-
timodal Music Processing, volume 3 of Dagstuhl
SIONE).
Follow-Ups, pages 157–174. Schloss Dagstuhl–
Leibniz-Zentrum für Informatik, Dagstuhl, Germany,
10. REFERENCES 2012.
[1] Andreas Arzt, Sebastian Böck, and Gerhard Widmer. [11] Frank Kurth and Meinard Müller. Efficient index-based
Fast identification of piece and score position via sym- audio matching. IEEE Transactions on Audio, Speech,
bolic fingerprinting. In Proceedings of the Interna- and Language Processing, 16(2):382–395, 2008.
tional Society for Music Information Retrieval Confer-
ence (ISMIR), pages 433–438, Porto, Portugal, 2012. [12] Meinard Müller, Frank Kurth, and Michael Clausen.
Audio matching via chroma-based statistical features.
[2] Andreas Arzt, Gerhard Widmer, and Reinhard In Proceedings of the International Society for Music
Sonnleitner. Tempo- and transposition-invariant iden- Information Retrieval Conference (ISMIR), pages 288–
tification of piece and score position. In Proceedings 295, London, UK, 2005.
of the International Society for Music Information Re-
[13] Mathieu Ramona and Geoffroy Peeters. Audioprint:
trieval Conference (ISMIR), pages 549–554, Taipeh,
an efficient audio fingerprint system based on a
Taiwan, 2014.
novel cost-less synchronization scheme. In Proceed-
[3] Shumeet Baluja and Michele Covell. Waveprint: Ef- ings of the IEEE International Conference on Acous-
ficient wavelet-based audio fingerprinting. Pattern tics, Speech, and Signal Processing (ICASSP), pages
Recognition, 41(11):3467–3480, 2008. 818–822, Vancouver, Canada, 2013.

[4] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Flo- [14] Joan Serrà, Emilia Gómez, and Perfecto Herrera. Au-
rian Krebs, and Gerhard Widmer. madmom: a new dio cover song identification and similarity: back-
Python Audio and Music Signal Processing Library. ground, approaches, evaluation and beyond. In Z. W.
In Proceedings of the 24th ACM International Con- Ras and A. A. Wieczorkowska, editors, Advances in
ference on Multimedia, pages 1174–1178, Amsterdam, Music Information Retrieval, volume 274 of Studies
The Netherlands, 10 2016. in Computational Intelligence, chapter 14, pages 307–
332. Springer, Berlin, Germany, 2010.
[5] Sebastian Böck and Markus Schedl. Polyphonic piano
[15] Joren Six and Marc Leman. Panako - a scalable acous-
note transcription with recurrent neural networks. In
tic fingerprinting system handling time-scale and pitch
Proceedings of the IEEE International Conference on
modification. In Proceedings of the International So-
Acoustics, Speech, and Signal Processing (ICASSP),
ciety for Music Information Retrieval Conference (IS-
pages 121–124, Kyoto, Japan, 2012.
MIR), pages 259–264, Taipei, Taiwan, 2014.
[6] Pedro Cano, Eloi Batlle, Ton Kalker, and Jaap Haitsma. [16] Reinhard Sonnleitner and Gerhard Widmer. Robust
A review of algorithms for audio fingerprinting. In Pro- quad-based audio fingerprinting. IEEE/ACM Trans-
ceedings of the IEEE International Workshop on Multi- actions on Audio, Speech and Language Processing,
media Signal Processing (MMSP), pages 169–173, St. 24(3):409–421, 2016.
Thomas, Virgin Islands, USA, 2002.
[17] Avery Wang. An industrial strength audio search al-
[7] Michael A. Casey and Malcolm Slaney. Song intersec- gorithm. In Proceedings of the International Society
tion by approximate nearest neighbor search. In Pro- for Music Information Retrieval Conference (ISMIR),
ceedings of the International Society for Music Infor- pages 7–13, Baltimore, Maryland, USA, 2003.
mation Retrieval Conference (ISMIR), pages 144–149,
Victoria, Canada, 2006.

[8] Peter Grosche and Meinard Müller. Toward character-


istic audio shingles for efficient cross-version music re-
trieval. In Proceedings of the IEEE International Con-
ference on Acoustics, Speech, and Signal Processing
(ICASSP), Kyoto, Japan, 2012.

[9] Peter Grosche and Meinard Müller. Toward musically-


motivated audio fingerprints. In Proceedings of the
IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), pages 93–96, Kyoto,
Japan, 2012.

You might also like