0% found this document useful (0 votes)
34 views163 pages

Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views163 pages

Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 163

Submitted by

Shreyan Chowdhury

Submitted at
Institute of Computational
Perception

Supervisor and
First Evaluator
Gerhard Widmer

Modelling Emotional Second Evaluator


Peter Flach

Expression in Music Using September 2022

Interpretable and Transferable


Perceptual Features

Doctoral Thesis
to obtain the academic degree of

Doktor der technischen Wissenschaften


in the Doctoral Program

Technische Wissenschaften

JOHANNES KEPLER
UNIVERSITY LINZ
Altenbergerstraße 69
4040 Linz, Österreich
www.jku.at
DVR 0093696
Shreyan Chowdhury: Modelling Emotional Expression in Music Using Interpretable
and Transferable Perceptual Features, © September 2022
ABSTRACT

Emotional expression is one of the most important elements underlying humans’


intimate relationship with music, and yet, it has remained one of the trickiest
attributes of music to model computationally. Its inherent subjectivity and context
dependence renders most machine learning methods unreliable outside a very
narrow domain. Practitioners find it hard to gain confidence in the models they
train, which makes deploying these models to user-facing applications (such as
recommendations that drive modern digital streaming platforms) problematic.
One approach to improving trust in models is through the path of explainability.
Looking specifically at deep end-to-end music emotion models, a fundamental
challenge that one faces is that it is not clear how the explanations for such
models might make sense to humans – are they even musically meaningful in
any way? We know that humans perceive music across multiple semantic levels
– from individual sonic events and sound texture to overall musical structure.
Therein lies the motivation for making explanations meaningful using features
that represent an intermediate level of musical perception.
This thesis focuses on mid-level perceptual features and their use in modelling
and explaining musical emotion. We propose an explainable bottleneck model
architecture and show that mid-level features provide an intuitive and effective
feature space for predicting perceived emotion in music, as well as explaining
music emotion predictions (“Perceive”). We further demonstrate how we can
extend these explanations by using interpretable components from the audio input
to explain the mid-level feature values themselves, thereby tracing the predictions
of a model back to the input (“Trace”). Next, we use mid-level features to tackle
the elusive problem of modelling subtle expressive variations between different
interpretations/performances of a set of piano pieces. However, given that the
original dataset for learning mid-level features contains few solo piano music clips,
a model trained on it cannot be transferred to piano music directly. To achieve
this, we propose an unsupervised domain adaptation pipeline to adapt our model
for solo piano pieces (“Transfer”). Compared to other feature sets, we find that
mid-level features are better suited to model performance-specific variations in
emotional expression (“Disentangle”). Finally, we provide a direction for future
research in mid-level feature learning by augmenting the feature space with
algorithmic analogues of perceptual speed and dynamics, two features that are
missing in the present formulation and datasets, and use a model incorporating
these new features to demonstrate emotion prediction on a recording of a well-
known musician playing and modifying a melody according to specific intended
emotions (“Communicate”).

iii
ACKNOWLEDGMENTS

This thesis would not have been possible without the support of several people.
First of all, I am deeply grateful to my supervisor Gerhard Widmer for his
mentorship and guidance throughout my time as a PhD candidate at the Institute
of Computational Perception. Thank you for teaching me how to ask the right
research questions, how to translate vague ideas into concrete steps, and how to
properly communicate scientific research.
I would also like to thank my second evaluator, Prof. Peter Flach, for taking the
time and effort to review this thesis.
My PhD journey at the institute was exciting and enjoyable, and I have my
friends and colleagues to thank for this. Thanks to Verena Praher for the many
discussions and collaborations, and for all the fun we had during conferences;
Andreu Vall for helping me with my very first machine learning experiments,
and for all the conversations about life and work; Florian Henkel for helping me
plan my defence, and for all the conversations about music and guitar; Khaled
Koutini for helping me with my numerous machine learning questions; Carlos
Cancino-Chacón, Silvan Peter, and Hamid Eghbal-zadeh for helping me with
my research and allowing me to brainstorm ideas; Lukáš Martak for keeping
the music alive and for the various jam sessions; Rainer Kelz for the many
philosophical mini-discussions over lunch; Luı́s Carvalho for being an amazing
office mate; Alessandro Melchiorre for the Easter lunches, for the puzzles, and
for always having something fun to do; and Charles Brazier for being the life of
the party, in all parties. Thanks also to Andreas Arzt, Matthias Dorfer, Harald
Frostel, and Filip Korzeniowski, who helped me immensely when I began my
PhD and made me feel at home.
I am also grateful to Claudia Kindermann for all the administrative help and
support she provided me that made my life in Linz easier.
I feel lucky to be able to call a bunch of amazing people my best friends –
Ankit, Reha, Ritika, Vaishnavi, and Zain. I could write an entire book about
you folks, but for now, I will let a simple “thank you” convey my gratitude.
I also feel lucky to have met Vishnupriya in Linz – thank you for all the jam
sessions, conversations, lunches, dinners, and hikes, and for being one of my
closest friends. Thanks also to Venkat, who I feel fortunate to have known since
my undergraduate days, and who has always offered his friendship and support
in the sincerest ways. I would also like to thank Ashis Pati, who I have always
looked up to as a musician, as a researcher, and as a human being.
I am eternally grateful for the unconditional love and support of my family –
my brother Ryan, my parents, my grandparents, my aunt, and my cousin Ritwika.
They have been my anchor always. I also thank the Sakharwade family for their
support and encouragement.

iv
Lastly, I would like to thank my partner Nitica. Without your support, this PhD
would surely not have been possible. Thank you for your unwavering belief in
me, and for your never-ending encouragement and motivation that has kept me
going. A special thanks for your thorough proofreading of this thesis – nothing
evades your sharp eye.

The research reported in this thesis has been carried out at the Institute of
Computational Perception (Johannes Kepler University Linz, Austria) and has
been funded by the European Research Council (ERC) under the European
Union’s Horizon 2020 research and innovation programme, grant agreements No.
670035 (project ”Con Espressione”) and 101019375 (”Whither Music?”).

v
Artwork created by the author of this thesis, with a little help from AI.

The text-to-image model DALL-E [154] was used to generate reference images, which
were then used as inspiration by the author for creating this digital painting.

DALL-E was given the following prompts:


Detailed digital art of a person playing music and communicating emotions
Detailed digital art of a robot listening to music and discovering emotions

vi
CONTENTS

1 this thesis in a nutshell 1


1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

i background
2 a primer on music emotion recognition 10
2.1 Perceived, Induced, and Intended Emotions . . . . . . . . . . . . . 11
2.2 Approaches to Music Emotion Recognition . . . . . . . . . . . . . . 12
2.3 Emotion Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Challenges in Music Emotion Recognition . . . . . . . . . . . . . . 19
3 a primer on explainability in machine learning 21
3.1 Defining Explainability/Interpretability . . . . . . . . . . . . . . . . 22
3.2 Interpreting a Linear Regression Model . . . . . . . . . . . . . . . . 24
3.3 Interpreting Black-box Models Using LIME . . . . . . . . . . . . . . 26
3.4 Evaluation of Feature-based Explanations . . . . . . . . . . . . . . . 27
3.5 Explainability in Music Information Retrieval . . . . . . . . . . . . 29

ii main work of this thesis


4 perceive: predicting and explaining music emotion using
mid-level perceptual features 32
4.1 A Hierarchical View of Music Perception . . . . . . . . . . . . . . . 34
4.2 The Mid-level Bottleneck Architecture . . . . . . . . . . . . . . . . . 39
4.3 Data: Mid-level and Emotion Ratings . . . . . . . . . . . . . . . . . 42
4.4 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . 46
4.5 Obtaining Explanations . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 54
5 trace: two-level explanations using interpretable input
decomposition 57
5.1 The Unexplained Part of a Bottleneck Model . . . . . . . . . . . . . 58
5.2 Going Deeper using Two-Level Explanations . . . . . . . . . . . . . 59
5.3 Local Interpretable Model-agnostic Explanations (LIME) . . . . . . 61
5.4 Explanations via Spectrogram Segmentation . . . . . . . . . . . . . 62
5.5 Explanations using Sound Sources . . . . . . . . . . . . . . . . . . . 68
5.6 Model Debugging: Tracing Back Model Bias to Sound Sources . . . 71
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 transfer: mid-level features for piano music via domain
adaptation 78
6.1 The Domain Mismatch Problem . . . . . . . . . . . . . . . . . . . . 79
6.2 Domain Adaptation: What is it? . . . . . . . . . . . . . . . . . . . . 81
6.3 Visualising the Domain Shift . . . . . . . . . . . . . . . . . . . . . . 85

vii
contents viii

6.4 Bridging the Domain Gap . . . . . . . . . . . . . . . . . . . . . . . . 86


6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 disentangle: emotion in expressive piano performance 101
7.1 The Data: Bach’s Well-Tempered Clavier . . . . . . . . . . . . . . . . . 102
7.2 Feature Sets for Emotion Modelling . . . . . . . . . . . . . . . . . . 104
7.3 Feature Evaluation Experiments . . . . . . . . . . . . . . . . . . . . 107
7.4 Probing Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 119
8 communicate: decoding intended emotion via an augmented
mid-level feature set 120
8.1 Augmenting the Mid-level Feature Space . . . . . . . . . . . . . . . 121
8.2 Decoding and Visualising Intended Emotion . . . . . . . . . . . . . 125
9 conclusion and future work 132
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

iii appendix
a datasets used in this thesis 136
a.1 The Mid-level Features Dataset . . . . . . . . . . . . . . . . . . . . . 136
a.2 The Soundtracks Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 137
a.3 The PMEmo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
a.4 The MAESTRO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.5 The DEAM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.6 The Con Espressione Dataset . . . . . . . . . . . . . . . . . . . . . . 140

bibliography 141
1
THIS THESIS IN A NUTSHELL

1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Musicality is a uniquely human trait. It is theorised that the emergence of music


in humans preceded the development of syntactically guided language. Music
has enabled humans to form larger and closer-knit social circles, aiding in social
cohesion and cooperation. The social and psychological reward of production
of, participation in, and perception of music has fundamentally shaped human
evolution: not only cultural evolution but even biological through feedback effects
of cultural inventions [157]. One of the deepest ways music impacts humans
psychologically is by its capacity to communicate and influence human emotion
[13, 31, 97].
This characteristic of music – to communicate emotion – is universal and is
said to be the primary reason humans engage with it. It is no surprise that music
has been regarded as the “language of emotions” [46]. In a modern-day setting,
while the experience of musical engagement has been augmented enormously
through technology, the emotional component has not been lost. For instance,
music streaming platforms have started moving in the direction of emotion-based
playlist recommendation as this approach is shown to benefit personalisation
even in very specific scenarios like recommending motivating songs for a running
playlist [11]. We have also seen in recent times the growth of professions like
music therapy, which uses music’s mood-regulating capability to reduce stress
and improve mood and self-expression. Yet, despite perception and expression
of emotion through music being such a fundamental and intuitive aspect of
human existence, music technology systems still utilise only rudimentary forms of
perceptual or emotion-based algorithms. There is growing recognition of the fact
that present algorithms that analyse, curate, or generate music (or facilitate any
sort of musical interaction with or among humans) are hitting a glass ceiling due
to this gap. In order to develop systems that are more in tune with human musical
intuition, one way forward is to equip computers with a deeper ‘understanding’
of music and its perceptual impact on humans [177].
In Western tonal music, expressed emotion is typically attributed to several
musical features like modality (major/minor), tempo (slow/fast), timbre (the

1
this thesis in a nutshell 2

quality or characteristic of a sound), and, of course, song lyrics. Songs based on


the major mode tend to sound joyous while those based on the minor mode tend
to sound melancholic. Slow music is often sad or calming, while fast music may
express excitement or anger. Musicians and music arrangers will often strategi-
cally choose musical instruments to convey precisely what they intend to and will
often structure their musical piece to generate tension and release as a means to
convey movement and control musical expectancy. While skilled musicians learn
these “rules” through years of practice and study, non-musicians (presumably a
large percentage of the audience) can also intuitively perceive and understand
what the musician is trying to convey. Not only this, but many listeners can per-
ceive even subtle changes in emotion and expression that arise between different
versions of a song, or between different performers performing the same piece
of music. In fact, performers often use such differences in expressive quality to
render a unique interpretation of a piece of music and inject some aspect of their
own “personality” and emotion into the music [8, 68]. However, it has proven to
be a challenge to program algorithms to capture this subtle yet rich information
present in musical audio.
In this thesis, we attempt to computationally disentangle some of the possible
factors underlying music emotion through the use of mid-level perceptual features,
which are intuitive musical qualities that fall in the middle of a hierarchy of
increasingly “semantic” audio-based features [39, 63]. Our goal is twofold. First,
we would like to introduce the notion of explainability into music emotion models
that allow us to investigate the relationships between the mid-level perceptual
features and emotion predictions (Chapter 4 and Chapter 5). Second, we would
like to focus on Western tonal solo piano music and explore the utility of these
features in modelling the subtle variation of emotion across performances by
different pianists of a set of classical piano pieces (Chapter 6 and Chapter 7).
Additionally, we also investigate augmenting the mid-level feature space using
approximations of two additional perceptual features, and use the improved
model to demonstrate real-time prediction of intended emotions (Chapter 8). A
key message of this thesis is that high-level music information retrieval tasks
(such as emotion recognition) can be improved in terms of accuracy, robustness,
and interpretability through the use of mid-level perceptual features. The work
done in this thesis builds on existing research in music emotion recognition and
explainable machine learning.

Music Emotion Recognition

Music Emotion Recognition (MER) is a relatively recent research area under


the broader Music Information Retrieval (MIR) field that aims to extract and
predict attributes related to perceived emotional content in, or induced emotions
from music1 . Although the relation between music and emotion has intrigued
researchers since the 1930s [81], it is only in the past couple of decades that

1 There is an important distinction between perceived and induced emotions [66]. Perceived emotion
refers to the emotion expressed or communicated by music, while induced emotion is felt by the
listener, in their body, in response to music.
this thesis in a nutshell 3

it has been seen as a technological challenge suitable to be addressed using


computational methods. The impetus to this cause came from the sub-field of
music search and retrieval. In the words of Huron [95], “the most useful retrieval
indexes are those that facilitate searching in conformity with such social and
psychological functions. Typically, such indexes will focus on stylistic, mood, and
similarity information.”
Over the years, several approaches have been tried for music emotion recogni-
tion, and most of these can be generally described as machine learning problems,
consisting of four distinct parts: emotion taxonomy definition, collection of data,
feature extraction, and regression or classification [142]. The suitability for ma-
chine learning in music emotion recognition lies in the fact that it is hard to
define the exact relation between musical features and perceived emotions and
design algorithms that would generalise to a large variety of musical styles and
audiences. The best hope is to learn such a relation from real-world data.
Data related to music emotion is typically captured through direct annotation
processes: asking human raters to listen and rate selected pieces of music on some
kind of an emotion scale. The scale can be discrete (e.g. emotion tags or labels)
or continuous (e.g. valence and arousal2 ). Sometimes, data acquired through
metadata associated with music such as emotion tags may also be used as data
for training music emotion models [116].
There are some unique factors that make music emotion recognition a chal-
lenging task: the inherent subjectivity of the task, the difficulty of annotating
a large number of songs with emotion attributes, and the so-called “semantic
gap” between the measurable low-level features in audio/lyrics/symbolic music,
and the high-level emotional attributes that we aim to predict [144]. Recent ef-
forts [74] have brought several important and traditionally overlooked aspects of
music emotion into focus, such as making the process more user-centric, context-
dependent and culturally sensitive. Gómez-Cañón et al. [74] also emphasise the
importance of explainability and interpretability in data-driven emotion recog-
nition models. A detailed description of concepts of music emotion recognition
and relevant past work is provided in Chapter 2.

Explainable Machine Learning

Several kinds of machine learning models, especially deep learning models


(networks consisting of multiple interconnected layers), are considered “black-
box”, referring to the opaqueness of their inner transformations on the input
data that result in the final output or decision. Even though the principle of
their operation can be understood, the basis of a particular decision, or what a
model has actually learned cannot be described in terms easily understandable
by humans. This can result in a wide range of undesirable consequences. The
lack of transparency causes these systems to be very difficult to debug, and leaves
room for bias to creep in with no proper way to analyse the model. As a result, it

2 Valence refers to the general positive or negative quality of an emotion, and arousal refers to the
intensity or degree to which the emotion is perceived. These emotion scales will be discussed in
detail in Chapter 2
1.1 thesis outline 4

becomes difficult to trust the predictions of a model, especially if they are being
used in critical systems such as health or finance. Non-explainable models also
do not provide “actionable-insights” about the predictions, since there are no
“if/then” connections between the inputs and the outputs. This can make the
overall system less user-friendly.
Explainable Machine Learning, or Explainable AI (XAI) is a field of artificial
intelligence (AI) that aims at making models and model predictions understand-
able by humans. Some models are interpretable by construction, such as linear
models and decision trees. In a linear model, the weights of the connections
between nodes can be interpreted as importance values. A decision tree produces
an output based on learned if/then/else rules which could be used to trace an
input to its prediction, thus providing full transparency. In other complex models,
we need to either introduce additional structural changes in a model, or analyse
a particular input-output pair using extrinsic algorithms.
There are several properties of explanations and explanation methods to con-
sider when developing useful interpretable machine learning systems. Relevant
to us in this thesis are: the expressive power of a method, and comprehensibility
of an explanation. Expressive power of a method refers to the ‘language’ or
structure of the explanations the method is able to generate. It could generate
if/then rules, decision trees, a weighted sum, natural language or something else
[137]. Comprehensibility of an explanation refers to how easily the explanations
themselves are understood by the target audience. When dealing with music
emotion models, it makes sense to explain predictions on the basis of features
that are “musically meaningful” and are informative for a human analysing the
model. This motivates the use of mid-level perceptual features (musically relevant
features that can be understood by most humans) in our work.
The basic principles and methods of explainability in machine learning are
described in more detail in Chapter 3, which serves as a primer to this topic for
the interested reader.

1.1 thesis outline


We now provide an outline of the structure of this thesis, with each subsection
corresponding to a chapter in the thesis. Chapter 2 and Chapter 3 are introduc-
tory chapters for music emotion recognition and explainable machine learning,
respectively, and serve as supplementary pre-requisites for the core work done in
this thesis.

1.1.1 Mid-level Features as Explanatory Variables for Music Emotion


Prediction (Chapter 4)

We begin by asking if music emotion, a high-level concept, can be modelled as


a function of mid-level perceptual features within a deep-learning architecture
framework (“Perceive”). The aim is to retain the performance accuracy of deep
end-to-end models, while also providing a means to explain predictions using
a handful of musically meaningful features, or explanatory variables. In this
1.1 thesis outline 5

endeavour, we turn to the mid-level perceptual features introduced by Friberg


et al. [63] and Aljanaki and Soleymani [9].
A set of seven mid-level features, namely: melodiousness, articulation, rhythmic
complexity, rhythmic stability, dissonance, tonal stability, and mode (or minorness), are
introduced in Aljanaki and Soleymani [9]. In addition, they propose learning
these features in a data-driven way and provide a dataset of audio clips annotated
with the mid-level feature values associated with each clip, collected through
crowd-sourcing.
In this chapter, we propose a bottleneck architecture with an intermediate layer
trained to predict mid-level features followed by a single linear feed-forward
layer that predicts the final emotion values from the mid-level features. This
architecture enables explainability of emotion predictions in the form of linear
weights learnt by the model between the mid-level and final layers.
We test three variants of this bottleneck network: independent, sequential and
joint (differentiated by whether the intermediate and final layers are trained
independently from separate datasets, one after the other with the outputs of
the mid-level model constituting the inputs of the linear layer, or jointly in a
multi-task learning framework), and compare them to a non-bottleneck model
trained end-to-end with emotion labels. To investigate whether the bottleneck
impairs performance, we compute a measure called cost of explainability. We find
that there is negligible impairment of performance, while we gain interpretability.
Interpretations can be derived for the model itself (by interpreting the learned
weights), or for single inputs (by looking at the effects of each mid-level feature
on the final prediction), or for an entire set of inputs (by looking at the range of
effects of each mid-level feature).

1.1.2 Two-Level Explanations: Mid-level Features and Input Components


(Chapter 5)

Explanations based on intermediate features still leave the black-box between the
actual inputs and the intermediate layer unexplained. In this chapter, we address
this by proposing a two-level explanation approach aimed at explaining the mid-
level predictions using components from the input sample (“Trace”). We explore
two approaches to decompose the input into components: 1) using spectrogram
segments, and 2) using sound sources (individual instrument tracks) of the input
music. To explain positive and negative effect of the components on mid-level
predictions, we use LIME (Local Interpretable Model-agnostic Explanations)
[155] and a variant of LIME for audio components, audioLIME [84]. We also
demonstrate the utility of this method in debugging a biased emotion model
that overestimates the valence for hip-hop songs. (This is joint work with my
colleague Verena Praher (née Haunschmid).)

1.1.3 Transferring Mid-level Features to Solo Piano Music (Chapter 6)

Next, we narrow in on one of our goals described earlier – to be able to capture the
subtle variations in expressive character between recordings of different pianists
1.1 thesis outline 6

playing the same piece of music. To this end, it is necessary to ensure that any
model we train with available datasets work well on solo piano music as well.
Following evidence from Cancino-Chacón et al. [37], which showed that mid-level
features are effective in modelling expressive character of piano performances,
we choose to transfer our mid-level model to solo piano performances. However,
due to relatively few solo piano clips being present in the Mid-level Dataset (on
which the mid-level model was trained), it is hard to justify using the model on
data consisting entirely of that genre of music. Thus, we use an adaptive training
strategy for the mid-level feature extractor using unsupervised domain adaptation
[71], improving its performance for the target domain, and in turn, giving a better
modelling of expressive character. We also propose a novel ensemble-based self-
training method for improving further the performance of the final adapted
model. This chapter details our methods and approach for this domain-adaptive
transfer (“Transfer”).

1.1.4 Modeling Emotion in Bach’s Well-Tempered Clavier (Chapter 7)

Delving deeper into the effectiveness of mid-level features for capturing and
explaining expressivity and emotion in piano performance, in this chapter, we
take a focused look at modelling the perceived arousal and valence for Bach’s
Well-Tempered Clavier Book 1 performed by six different famous pianists. We
compare mid-level features with three other feature sets – low-level audio features,
score-based features (derived from the musical score), and features derived from
a deep neural network trained end-to-end on music emotion data. We specifically
quantify how well do these feature sets explain emotion variation 1) between
pieces, and 2) between different performances of the same piece. We find that
in addition to an overall effective modelling of emotion, mid-level features also
capture performance-wise variation better than the other features. This indicates
the usefulness of these features in our overarching goal of modelling subtle
variations in emotional expression between different performances of the same
piece (“Disentangle”). We also test the features on their generalisation capacity
for outlier performances – those performances (one for each piece) that are most
distant from the rest on the arousal-valence plane, and are held-out during
training. We find that mid-level features outperform the other feature sets in this
test, thereby indicating their robustness and generalisation capacity.

1.1.5 Augmenting the Mid-level Feature Space and Demonstrating Model


Predictions vis-à-vis Intended Emotion (Chapter 8)

To wrap up the thesis, we prototype possible routes to augment the mid-level


feature space using two features conspicuously missing from Aljanaki and Soley-
mani [9]: perceptual speed and dynamics. We approximate these perceptual features
by computing them in a more direct way, based on our musical intuition rather
than on empirical user perception data, and analyse their impact on emotion
prediction together with the original seven mid-level features. As a final demon-
stration of our emotion prediction model using the augmented set of mid-level
1.2 contributions 7

features, we predict continuously varying emotions in a recording of a musician


playing a tune in different ways to encode and communicate (to the listener)
specific intended emotions (“Communicate”).

1.2 contributions
This thesis makes several novel advances in the area of explainability in music
information retrieval. The main contributions are summarised as follows:

1. Introducing the notion of mid-level feature-based explanations to deep


music emotion models (Chapter 4). The proposed bottleneck architecture
is easy to understand and implement, and any model could potentially be
made explainable with minimal change given a relevant dataset.

2. Demonstrating the application of the well-known LIME (Local Interpretable


Model-agnostic Explanations) algorithm in the music domain. We propose
two-level explanations and use LIME to generate explanations for mid-level
feature predictions on the input domain – spectrograms or sound sources
(Chapter 5). We also show how the two-level explanations could potentially
be used in a model-debugging setting.

3. Identifying that different genres/styles of music may introduce domain shift


for mid-level feature models, and proposing a domain adaptation approach
to address this issue (Chapter 6). We also propose a new ensemble-based
self-training method to refine the adaptation process that may be applicable
in other domain adaptation tasks as well.

4. Comparing mid-level features with traditionally used low-level features for


emotion modelling and demonstrating its effectiveness in reducing the so-
called semantic gap in music information retrieval. The effectiveness of these
features is exhibited in a series of experiments to model and disentangle the
variation of emotion between different performances of pieces of classical
piano music (Chapter 7).

5. Proposing possible routes to augment the mid-level feature space using


analogues of perceptual speed and dynamics. Using the augmented set of
mid-level features, we also demonstrate real-time visualisation of decoded
emotion and compare it to a musicians intended emotions (Chapter 8).
1.3 publications 8

1.3 publications
The main chapters of this thesis build on the following publications (in order of
appearance in the thesis chapters):

• S. Chowdhury, A. Vall, V. Haunschmid, G. Widmer


Towards Explainable Music Emotion Recognition: The Route via Mid-level
Features, In Proc. of the 20th International Society for Music Information Retrieval
Conference (ISMIR 2019), Delft, The Netherlands 0

• S. Chowdhury, V. Praher, G. Widmer


Tracing Back Music Emotion Predictions to Sound Sources and Intuitive
Perceptual Qualities, In Proc. of the Sound and Music Computing Conference,
(SMC 2021), Virtual 0

• V. Haunschmid, S. Chowdhury, G. Widmer


Two-level Explanations in Music Emotion Recognition, Machine Learning
for Music Discovery Workshop, International Conference on Machine Learning
(ICML 2019), Long Beach, USA 0

• S. Chowdhury and G. Widmer


Towards Explaining Expressive Qualities in Piano Recordings: Transfer
of Explanatory Features via Acoustic Domain Adaptation, In Proc. of the
International Conference on Acoustics, Speech and Signal Processing (ICASSP
2021), Toronto, Canada 0

• S. Chowdhury and G. Widmer


On Perceived Emotion in Expressive Piano Performance: Further Experi-
mental Evidence for the Relevance of Mid-level Features, In Proc. of the 22nd
International Society for Music Information Retrieval Conference (ISMIR 2021),
Virtual 0

• S. Chowdhury and G. Widmer


Decoding and Visualising Intended Emotion in an Expressive Piano Perfor-
mance, Late-breaking Demo Session, 23rd International Society for Music Infor-
mation Retrieval Conference (ISMIR 2022), Bengaluru, India. (Under review)

In addition to the above, the following publications contain contributions resulting


from my work:
• K. Koutini, S. Chowdhury, V. Haunschmid, H. Eghbal-zadeh, G. Widmer
Emotion and Theme Recognition in Music with Frequency-Aware RF-
Regularized CNNs, In Proc. of MediaEval Multimedia Benchmark 2019, Sophia
Antipolis, France 0

• C. Cancino-Chacón, S. Peter, S. Chowdhury, A. Aljanaki, G. Widmer


On the Characterization of Expressive Performance in Classical Music: First
Results of the Con Espressione Game, In Proc. of the 21st International Society
for Music Information Retrieval Conference (ISMIR 2020), Montreal, Canada
0
Part I

BACKGROUND
2
A PRIMER ON MUSIC EMOTION
RECOGNITION

2.1 Perceived, Induced, and Intended Emotions . . . . . . . . . . . . . 11


2.2 Approaches to Music Emotion Recognition . . . . . . . . . . . . . . 12
2.3 Emotion Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Challenges in Music Emotion Recognition . . . . . . . . . . . . . . 19

Music Emotion Recognition (MER) is a task – under the broader field of Music
Information Retrieval (MIR) – that aims at developing computer systems capable
of recognising the emotional content in music, or the emotional impact of music
on a listener. MER is an interdisciplinary area that combines research from music
psychology, audio signal processing, machine learning, and natural language
processing. Research on emotional analysis of music has a long history, dating
back to the 1930’s [89], but has gained newfound interest in recent decades due
to development of technologies that have enabled direct application of emotion
recognition systems. The availability of large volumes of good quality music
recordings in digital format, the development of search-and-retrieval systems
for music, streaming platforms, the progress in digital signal processing and
machine learning have all enabled interest in, and development of, automatic
music emotion recognition systems [104].
The aim of this chapter is to provide a brief overview of current and past
approaches in music emotion recognition, while also covering some aspects of
the psychology of music emotion that are relevant for this thesis. We begin by
noting the different types of emotion – perceived, induced, and intended – that are
important for setting the scope of emotion datasets and recognition approaches.
Next, we describe past works in music emotion recognition, followed by an
in-depth look at the typical pipeline for a MER system that involves dataset
collection, model training, and model evaluation. We then explore some of the
models for naming and representing emotions from studies in psychology that
are relevant to music emotion recognition. Here we will also discuss Russell’s
two-dimensional model of representing emotions [159], which is used extensively
in this thesis.

10
2.1 perceived, induced, and intended emotions 11

2.1 perceived, induced, and intended emotions


Humans are capable of not only perceiving (recognising) what emotion a piece
of music is trying to convey, but emotions are often also elicited within (felt
by) listeners. These two types of emotions may not be the same. For example, a
person may recognise that a music is supposed to express anger, but they may not
feel angry themselves while listening to it. Similarly, a listener might recognise
that a song is supposed to express happiness, but they may feel a bittersweet
nostalgia from a memory recalled by that song.
Before beginning to build MER systems, it is useful to identify what we mean by
“emotion” in a specific context, as it defines the scope of the system and dictates
what kind of data to use. In psychology research, an important distinction is
made between perceived and induced emotions [165]. A third type of emotion,
called intended emotion, is sometimes also identified in a musical context [142],
that relates to what emotion a musician intends to encode into the music. Thus,
the three kinds of emotions associated with music are:

• Perceived emotion: concerns the emotion the listener identifies when listen-
ing to a song, which may be different from what the composer attempted
to express and what the listener feels in response to it.

• Induced emotion: relates to the emotion that is felt by (evoked in) the
listener in response to the song. Also referred to as elicited emotion.

• Intended emotion: pertains to the emotion that the composer or performer


aimed to transmit with the musical piece or performance.

The relation between perceived and induced emotions has been a subject
of discussion among researchers, highlighting the complex nature of music
emotion and its manifestations. An illustrative example is the so-called ‘paradox
of negative emotion’, where music generally characterised as conveying negative
emotions (e.g., sadness, depression, anger) is often judged as enjoyable [146].
Most MER systems aim at recognising perceived emotion. This is because per-
ceived emotion is a “sonic-based phenomenon, tightly linked to auditory percep-
tion, and consisting in the listener’s attribution of emotional quality to music”
[146]. It tends to have a high inter-rater agreement when compiling emotion
data from listeners (different listeners are more likely to agree on the perceived
emotion independently of musical training or culture) [142]. On the other hand,
induced emotion is an individual phenomenon, influenced largely by personal
experiences, memory, context, and pre-existing mood.
We note here that the datasets used and experiments conducted in this thesis
all concern perceived emotion, as we are interesting in analysing the emotion
decoded by listeners from the music content. Most of the datasets available in
the literature on music emotion also describe perceived emotion, which makes
it possible for our models (presented later in this thesis) to learn from various
different emotion datasets. However, in a demonstration of real-time emotion
recognition in Chapter 8, we have access to a musician’s intended emotions, and
we visualise the recognised emotions alongside the intended emotions.
2.2 approaches to music emotion recognition 12

2.2 approaches to music emotion recognition


In this section, we first take a look at some of the past research that has shaped
music emotion recognition research. Next, we describe the four components of a
typical music emotion recognition pipeline.

2.2.1 Past Work in MER

Some of the early works on automatic music emotion recognition were done in
the 2000’s. Huron [95] explored methods to characterise musical mood using
emotion representation models, and Liu et al. [118] used Gaussian Mixture Models
(GMMs) to predict musical mood from low-level features extracted from audio
content (such as autocorrelation-based tempo, RMS or Root Mean Squared energy
from the time domain signal, and spectral features such as spectral centroid
and bandwidth). Yang and Lee [180] framed music emotion intensity prediction
as a regression problem and used Support Vector Regression (SVR) to predict
emotional intensity from low-level acoustic features. Li and Ogihara [117] used
a Support Vector Machine (SVM) based multi-label classification approach to
classify music emotion into thirteen adjective groups, and six supergroups. They
used a small dataset (499 audio clips) annotated by one person, from which
30 low-level acoustic features were extracted to train and evaluate the model.
Yang et al. [182] cast the goal as a regression problem, and used Multiple Linear
Regression (MLR), SVR, and AdaBoost to train regressors for predicting arousal
and valence values1 . They used a dataset annotated by several human listeners
who were educated on the purpose of the experiment and the essence of the
emotion annotation procedure and the emotion scales.
Over the years, researchers expanded the feature space using features such
as MFCCs (Mel-Frequency Cepstral Coefficients), periodicity histograms, and
fluctuation patterns. Dataset sizes also increased and experiments, and techniques
such as dimension reduction (using Principal Component Analysis) started be-
ing implemented for better emotion modelling [125, 134]. More recently (post
2010), there have been several works using deep learning models such as Long
Short Term Memory (LSTM), deep belief networks, and Boltzmann machines [93,
163]. Weninger et al. [175] used segmented feature extraction and deep recur-
rent networks to predict continuous emotion across time. Chaki et al. [40] used
attention-based LSTMs for the same goal. Several modern methods use Convo-
lutional Neural Networks (CNNs, or ConvNets) for automatic feature learning
from mel-spectrogram inputs. Delbouys et al. [47] used ConvNets for feature
learning from a large music collection of around 18,000 songs, and feed-forward
dense layers attached to the ConvNet for predicting arousal and valence.
The interested reader may refer to Kim et al. [104] for an extensive survey of
early MER methods, and to Han et al. [81] for an extensive survey of modern
MER methods.

1 Arousal and valence scales are described in Section 2.3.


2.2 approaches to music emotion recognition 13

2.2.2 A Typical MER Pipeline

Gómez-Cañón et al. [74] provide a breakdown of a traditional MER pipeline,


bringing out its various components or stages: emotion taxonomy definition,
dataset creation, feature extraction, and training emotion models. A visual depic-
tion of these four components, adapted from Gómez-Cañón et al. [74], is shown in
Figure 2.1. Let us draw on this view of the pipeline to look deeper into the MER
components, because such a structured pipeline is helpful for understanding the
experiments in this thesis.

1. Taxonomy Definition: Emotion taxonomy refers to the scheme chosen for


representing or naming emotions. All further steps depend on this, so it is
crucial to choose one best suited for the application at hand. These schemes
are derived from psychology studies of emotion. A discussion of selected
emotion representation models (schemes) is given in Section 2.3.

2. Dataset Creation: Based on the annotation scheme chosen in the previous


step, a labelled dataset is constructed, with songs and associated labels (or
ratings) gathered through crowd-sourced interfaces or controlled listening
environments.

3. Feature Extraction: Features, usually based on signal processing method-


ologies, and their statistical properties are extracted from the audio and
pre-processed to be ingested by a machine learning algorithm during train-
ing and evaluation. In modern deep learning based systems, the feature
extraction stage is a part of the machine learning model itself, which learns
to extract relevant features by learning end-to-end from input data (in the
form of audio spectrograms) and annotations.

4. Training and Evaluation: A machine learning model is trained on the


annotated dataset, with a hold-out set used for evaluation of the model.

An important factor to consider during the four stages of the pipeline is whether
the music emotion recognition system should output static or dynamic emotion
predictions. Static emotion prediction for a song means that the model takes
into account the entire song and outputs a single instance of predicted values
of emotion descriptors for the song. In dynamic emotion prediction, the model
predicts emotions for several different points along the song – often the prediction
windows are as small as a few seconds. In this thesis, we will principally deal
with static emotion recognition (however, we will also demonstrate dynamic
emotion recognition in Chapter 8).
Among the four components described above, we will describe taxonomy
definition and feature extraction in detail in this chapter. Dataset creation is
beyond the scope of this thesis, but we elaborate some existing datasets that are
used in this thesis in Appendix a. Model training and evaluation is dependent on
the particular context in which the MER is to be used, and we describe this for
our context in the main chapters of this thesis. First we look at feature engineering
and extraction; emotion taxonomies are described in Section 2.3.
2.3 emotion taxonomies 14

2.2.3 Feature Engineering and Extraction

In traditional machine learning, feature extraction algorithms are handcrafted


by domain experts who take into account the task and data at hand and design
specific features for those particular conditions. In the acoustic domain, these
features are typically based around signal processing methods. For example, a
common scheme is to first convert the audio to a time-frequency representation
(called a spectrogram) using short-time Fourier transform, and then extract
spectral features, such as spectral centroid, spectral bandwidth, etc. Because
these features are more proximal to the physical data representation, they are
usually referred to as low-level features (as opposed to high-level features like genre,
instruments, tags, rhythm patterns, etc.).
Panda et al. [144] recently reviewed several features specifically for music
emotion recognition drawn from music-theoretic concepts. For example, rhythm
features like onset time, duration of events, and event density are related to
tempo, which is in turn related to emotion (high tempo is associated with several
emotions like happy, active, angry etc.). Similarly, features such as RMS energy,
sound level, timbral width, and low energy rate are related to dynamic levels,
which in turn mediate emotion (low/soft dynamic levels tend to be associated
with melancholic, peaceful, sad, emotions).
Modern music emotion recognition systems are progressively being built
around the end-to-end learning framework, in which CNNs (Convolutional
Neural Networks) hold centre-stage. Liu et al. [120] provide an exemplary analy-
sis of such methods. CNNs are comprised of a series of convolutional layers that
learn weights to extract features automatically from the input representation. The
input representation is most commonly magnitude spectrograms presented in
the form of an image. This CNN-based feature learning framework is the method
that is used in this thesis.

2.3 emotion taxonomies


Research in psychology has provided us with various ways to represent emotions.
Since emotion is a rather abstract concept, and given that the ways in which
people express emotion varies with cultural background and language, it is
useful to encode emotion in a systematic way for the purposes of analysis and
study. In the context of music emotion recognition, a consistent system of naming
and representing musical emotions is essential to formulate a dataset collection
strategy, and eventual training and evaluation strategies. Over the decades, several
different models of representing emotions have been proposed by psychologists,
each suitable for certain applications. Discrete models of emotion were some of
the first such models. These models represent emotions using words or categories
of words, similar to how humans would usually describe emotions – such as
happiness, surprise, sadness, anger, disgust, contempt, and fear. Hevner’s model
(1936) is perhaps the most well-known model for music emotion [89] while
Ekman’s model of basic emotions (1971) explored some universal emotions from
facial expressions [55]. Later on, dimensional models of emotion were proposed
2.3 emotion taxonomies 15

Dataset Creation Taxonomy Definition

Dimensional Categorical
Arousal
passionate
serene
happy sober
Valence sad
Annotations joyous
vigorous
dark

Feature Extraction Model Training and Evaluation

Music
Content Feature Classification /
Extraction Model Regression

Music
Context

Figure 2.1: Traditional MER Pipeline, adapted from Gómez-Cañón et al. [74]

to capture the continuous spectrum of emotions that humans experience, the


most well-known of which is Russell’s two-dimensional model (1980) of arousal
and valence [159]. This model is used in this thesis extensively.
We describe some of these models in detail below. We will look at both discrete
models of emotion (Ekman’s emotions, Hevner’s model, and the Geneva Emo-
tional Music Scale), and dimensional models of emotion (Russell’s Circumplex,
and the Schimmack and Grob model). Other models based on slightly different
dimensions or categories have also been proposed in the literature (such as the
Vector model of Bradley et al. [33], Plutchik’s model [148], and the PAD emotional
state model [132]), but we do not discuss these here.

2.3.1 Discrete Emotion Models

The most natural ways people express emotions is through facial expressions and
words. Discrete emotion models are based around the mode of expression that
uses words, or more generally groups of words. For emotions transmitted using
language, Ekman’s set of emotion words constitute a semantically distinct set
that spans a wide range of emotions. In the context of music emotion, certain
words can convey slightly different or nuanced emotions, which is what makes
Hevner’s model and the Geneva scale relevant.

1) Ekman’s Basic Emotions


Based on evolutionary considerations and the semantic distinction, Ekman
[55] proposed six basic emotions: happiness, sadness, fear, disgust, anger,
and surprise. This model is one of the most widely used models in human
emotion expression studies, due to its simplicity and range of emotions,
and its relevance to semantic clustering of language [18]. This model has
also been adapted for music emotion recognition. Mohn et al. [136] showed
2.3 emotion taxonomies 16

merry
joyous
exhilerated gay humorous
soaring happy playful
triumphant cheerful whimsical
dramatic bright quaint
passionate
sprightly
sensational delicate
agitated light
exciting
graceful
impetous
restless
vigorous lyrical
robust leisurely
emphatic satisfying
martial serene
ponderous tranquil
majestic quiet
exalting soothing

spiritual dreamy
lofty yielding
awe-inspiring tender
dignified sentimental
sacred pathetic longing
solemn doleful yearning
sober sad pleading
serious mournful plaintive
tragic
melancholic
frustrated
depressing
gloomy
heavy
dark

Figure 2.2: Hevner’s adjective circle, redrawn from Hevner [89] (colours added to the
clusters by the present author). Each cluster contains adjectives with similar
meaning in terms of music emotion, and neighbouring clusters represent close
emotions. One adjective in each cluster, that describes the cluster, is marked
in bold.

that all of the six basic emotions are detectable in music. One drawback
of this model is that it is not easy to define other nuanced emotions in
terms of these basic emotions, which may lead one to question the notion
of “basic-ness” of these emotions [140].

2) Hevner’s Adjective Circle (Figure 2.2)


Music is capable of communicating a wide spectrum of emotions, and
categorising music emotion into only six basic emotions misses a lot of that
variation. Hevner’s system of representing emotions pertains specifically to
music emotions, and consists of grouped adjectives arranged in the form of
a circle. Each group contains adjectives that are close in meaning and form a
distinct emotion category. Neighbouring groups are emotionally close while
groups on opposite ends of the circle represent contrasting emotions. This
model has stayed relevant in modern music emotion recognition research
[19, 37], and provides a foundation for other discrete, and even dimensional,
models of music emotion.
2.3 emotion taxonomies 17

3) Geneva Emotional Music Scale


The GEMS (Geneva Emotional Music Scale) [183] comprises nine categories
of musical emotions (wonder, transcendence, tenderness, nostalgia, peace-
fulness, energy, joyful activation, tension, sadness). These nine categories
are further divided on a lower level into a total of 45 terms, while on the
other, higher level end, condensing into 3 “superfactors” (sublimity, vitality,
and unease). Some issues with this model, as mentioned in [54], are the
small size of the experiment and over-representation of classical music.

2.3.2 Dimensional Emotion Models

A major drawback of discrete emotion models is that they do not represent


the complex spectrum of emotions that humans are capable of experiencing
[161]. People often report feeling multiple emotions at the same time, leading
to the hypothesis that an emotional space that lies in a continuum and allows
for interactions may be a more accurate representation of human emotional
experience [151]. This has resulted in development of various dimensional models
of emotion, most often consisting of two dimensions. Dimensional models are
based on the suggestion that emotional states arise from the combination of two
distinct neurophysiological systems (activity regions of the brain) relating to
valence (positive or negative emotion) and arousal (intensity of emotion) [150,
160]. Variations of the arousal-valence model have also been proposed with
slightly different axes, for instance splitting the arousal axis into tension and
energy axes, which, along with valence, gives a three dimensional model.
We describe these two dimensional models here. The first one is the two-
dimensional model with arousal and valence axes (Russell’s circumplex), and the
second one is the three-dimensional model with tension, energy, and valence axes
(Schimmack’s model).

1) Russell’s Circumplex Model of Emotion (Figure 2.3)


The original circumplex model proposed by Russell is based on the notion
that affective states (discrete emotions) are not independent of one another,
but are related to each other, and that these can be represented as a circle on
a two-dimensional bipolar plane [159]. He placed eight emotion categories
on this plane, starting from vertical, and moving clockwise in 45◦ steps:
arousal, excitement, pleasure, contentment, sleepiness, depression, misery,
and distress. Through several experiments with participants involving cat-
egory sorting and circular ordering tasks of verbal emotion descriptors
(based on semantic similarity of the words), 28 terms were placed on the
circle. The participants also provided similarity scores between the terms.
Through multidimensional scaling analyses of the similarity scores, the
two-dimensional nature of these terms (rather than any other number of
dimensions) was validated. The positions of the terms as presented in Figure
2 of [159] is shown in Figure 2.3 and used in this thesis.
2.3 emotion taxonomies 18

The vertical and horizontal axis of Russell’s circumplex can be described


using arousal and valence, with neurophysiological basis, as later validated
by [150].
• Arousal (or intensity) is the level of autonomic activation that an event
creates, and ranges from calm (or low) to excited (or high) [26].
• Valence is the level of pleasantness that an event generates and is
defined along a continuum from negative to positive [26].
The two-dimensional plane can be divided into four regions or quadrants:
1) excitement, representing happy and energetic emotions (Q1); 2) anxiety
or distress, representing frantic and energetic emotions (Q2); 3) depres-
sion, referring to melancholic and sad emotions (Q3); and 4) contentment,
representing calm and positive emotions (Q4) [142].

alarmed aroused
tense astonished
afraid angry excited
annoyed
distressed
frustrated
delighted
Q2 Q1
happy
Arousal

miserable pleased
Q3 Q4
sad
depressed
gloomy serene
content
at ease
bored satisfied
relaxed
calm
droopy
tiredsleepy
Valence
Figure 2.3: Russell’s circumplex, adapted from Figure 2 of Russell [159]

2) Schimmack and Grob model of emotion


Based on Russell’s circumplex, alternative dimensional models using differ-
ent labels in each axis have also been proposed in the literature [170, 173].
One of them is Thayer’s model, which suggests that the two underlying
dimensions of affect are two separate arousal dimensions: energetic arousal
and tense arousal. According to this, valence could be explained as varying
combinations of energetic arousal and tense arousal [170].
The Schimmack and Grob model of emotion [162] merges Russell’s and
Thayer’s models, obtaining valence, tense arousal, and energetic arousal
dimensions (illustrated in Figure 2.4).
2.4 challenges in music emotion recognition 19

Tension arousal
al
rous
gy a
Ener

Val
ence

Figure 2.4: Schimmack model

A brief summary of the discussed emotion models is presented in Table 2.1.

emotion model type summary

Ekman’s basic emotions Categorical 6 words, considered the ba-


sic emotions
Hevner’s adjective circle Categorical 8 clusters, 67 adjectives
Geneva Emotional Music Scales Categorical 45 terms, 9 categories, 3 su-
perfactors
Russell’s circumplex Dimensional 2 dimensions: arousal and
valence
Thayer Dimensional 2 dimensions: energetic
arousal and tense arousal
Schimmack and Grob Dimensional 3 dimensions: valence, ener-
getic arousal, tense arousal

Table 2.1: Summary of emotion representation models.

2.4 challenges in music emotion recognition


Once a complex system like music emotion recognition is brought out from a
purely research setting to usable (and potentially industrial) applications, several
challenges become apparent. Gómez-Cañón et al. [74] emphasise the need for
more user-centric systems to be developed that also have components of repro-
ducibility and contextuality built-in. They also stress on the importance of a
feedback loop – a link between the final evaluation stage and the initial taxon-
omy definition to refine these concepts. An important component of future MER
systems that they point out is interpretability. They argue that the explainability
and interpretability of models are critical for evaluating the subjective construc-
tions of emotions, as these data-driven decisions might be used downstream
2.4 challenges in music emotion recognition 20

for emotion regulation applications as well in addition to recommendation and


search-and-retrieval.
Another area where current MER technologies lack, even in the pure research
realm, is developing methods for computer systems to recognise subtle variations
in emotional and expressive cues between different versions and renditions of a
song, something that humans are sensitive to. Widmer [177] argues for computer
systems capable of ‘understanding’ music to the degree that allows them to
be sensitive to such variations. In addition to that, current emotion recognition
systems lack the capacity to be sensitive to musical redundancy and structure, to
be aware of a listener’s expectations from a piece of music, and respond to the
affective needs of listeners and artists – goals that are also deemed important by
Widmer [177].
The research undertaken for the present thesis aims at addressing the ques-
tions of explainability in music emotion recognition, and of recognising subtle
emotional variations between renditions of the same piece of music. The next
chapter introduces the notion of explainability, and after that the main chapters
of the thesis follow.
3
A PRIMER ON EXPLAINABILITY IN
MACHINE LEARNING

3.1 Defining Explainability/Interpretability . . . . . . . . . . . . . . . . 22


3.2 Interpreting a Linear Regression Model . . . . . . . . . . . . . . . . 24
3.3 Interpreting Black-box Models Using LIME . . . . . . . . . . . . . . 26
3.4 Evaluation of Feature-based Explanations . . . . . . . . . . . . . . . 27
3.5 Explainability in Music Information Retrieval . . . . . . . . . . . . 29

An Achilles’ heel of machine learning is that even though a machine learning


system may perform remarkably well for a task, it is often impossible to under-
stand or delineate how exactly a model arrived at a particular prediction from
the input data. It is also not always clear what the model has actually learnt after
the training procedure, especially in end-to-end models. In deep learning, an
end-to-end model refers to models that learn to extract relevant features from
the raw input data, instead of ingesting data transformed by human-designed
feature-extraction algorithms. In practice, end-to-end models almost always out-
perform feature-based models. However, in addition to the non-interpretable
model weights that deep neural networks possess, automatic feature learning
closes the blinds on what the model actually “sees” in the input data. This lack
of transparency can hinder trustworthiness of the system, increase the difficulty
of debugging models, enable harmful bias to creep into real-world applications,
or even lead to potentially life-threatening decisions, for example in medical
settings.
The need to understand what a model is learning and how a model arrives at
its predictions have led to the evolution of a recent field of AI (Artificial Intelli-
gence), known as Explainable AI (XAI). Explainable artificial intelligence aims at
introducing methods and techniques to interpret AI decisions and predictions
such that they are understandable by human stakeholders. In this chapter, we take
a brief look at some of such techniques to improve interpretability of so-called
black-box models, as a foundation for the rest of the thesis. We first start with
defining this elusive concept of explainability and describing useful properties
in Section 3.1. Next, we look at interpreting a simple linear regression model
in Section 3.2. Then we describe how to use a popular algorithm, LIME (Linear
Interpretable Model-agnostic Explanations), to explain a black-box model in Sec-
tion 3.3 In Section 3.4, we look at metrics to evaluate explanations. Finally, in

21
3.1 defining explainability/interpretability 22

Input Output
Black Box Model

Figure 3.1: A black-box model

Section 3.5, we lay out the scope of explainability in music information retrieval
and some of the questions that research in this area is addressing.

3.1 defining explainability/interpretability


It is difficult to have a precise definition for something like explainability that is
diverse in terms of origin, goals, and methods. Thus, the literature often resorts to
defining it in terms of what it allows us to do. Molnar [137] uses ”explainability”
and ”interpretability” interchangeably, and we shall do the same in this thesis,
unless a specific contextual distinction in meaning is specified1 . We borrow the
two definitions given in Molnar [137], which are taken from the referred sources:

“Interpretability is the degree to which a human can understand the


cause of a decision” [133].

“Interpretability is the degree to which a human can consistently


predict the model’s result” [103].

The umbrella term “explainable machine learning” captures the methods


and techniques to extract relevant knowledge from a machine learning model
concerning relationships contained in the data and learned by machine learning
models. Explainable methods are most commonly categorised based on the scope
of application (global, where explanations are derived pertaining to the overall
behaviour of the model, vs. local, where explanations for a specific input are
derived) or mode of application (model agnostic, methods that could be applied
on any model without investigating model parameters, vs. model specific, methods
that depend on the model type). Without going into detail for each of the several
explainability methods currently in practice, which would be out of scope for this
chapter, we describe methods pertinent for understanding this thesis.
Interpretability comes by design in algorithms where there is a clear and
intuitive understanding of the decision making process. Linear regression is an
example of such an algorithm, where there is a simple linear relationship between
the input and the output. We describe explainability for linear regression models
in Section 3.2.

1 Sometimes, a distinction between explainability and interpretability is drawn, where explainability


refers largely to a posteriori methods that make a model’s nature and behaviour understandable in
human terms, while interpretability refers to understanding the exact inner mechanics of a model
through a priori knowledge of the model’s architecture.
3.1 defining explainability/interpretability 23

Explainability
Methods

Approaches Based
Approaches Based
on Mode of
on Scope
Applicarion

Model Model
Global Local
Agnostic Specific

Figure 3.2: Two ways to describe explainable AI (XAI) approaches are shown here. XAI
Methods can be described based on the scope of application (global, where
explanations are derived pertaining to the overall behaviour of the model, vs.
local, where explanations for a specific input are derived) or mode of appli-
cation (model agnostic, methods that could be applied on any model without
investigating model parameters, vs. model specific, methods that depend on
the model type).

For black-box models, where interpretability is not built-in, various algorithms


have been devised. Examples include attribution methods like LIME [155], Grad-
CAM [164], and Shapley Values [166], prototype-based methods [43], influence
functions [106], contrastive explanations [49], and classical methods like partial
dependence plots and permutation feature importance (more relevant for struc-
tured data). In this thesis, LIME (Local Interpretable Model-agnostic Explanations)
is used, and we describe it in detail in Section 3.3.
Given so many different methods, how do we judge what a suitable method is
for an application in question? Looking at some properties of explanations may
guide us towards evaluating the quality of explanations.

3.1.1 Properties of Explanations

A useful direction towards formalising explainability is to define some tangible


properties of explanations, which can in principle be measured. We look at the
properties of individual explanations here, as described in Molnar [137].

• Accuracy: A measure of prediction performance when the prediction is


obtained from the explanations in place of the machine learning model.

• Fidelity: A measure of how closely the explanations approximate the predic-


tions of a black box model. A low-accuracy explanation may be high-fidelity
if the black box model also has low accuracy.

• Consistency: A measure of how similar the explanations of models trained


on the same task and producing similar predictions are to each other. A
caveat is that while the model algorithms/architectures can be different (e.g.
linear regression vs. support vector machine), if the models use different
features to produce similar predictions, then a high consistency may not be
desirable.
3.2 interpreting a linear regression model 24

• Stability: A measure of the robustness of explanations to similar instances.


Explanations for two similar instances are expected to be similar for a given
model. High stability is always desirable.

• Comprehensibility: This refers to how well humans understand the expla-


nations. Comprehensibility is difficult to define and measure but is one of
the most important properties. It depends not only on how the explanation
is presented, but also on the audience. Ideas for measuring comprehensi-
bility include measuring the size of the explanation (number of features
with a non-zero weight in a linear model, number of decision rules, etc.) or
testing how well people can predict the behaviour of the machine learning
model from the explanations.

• Certainty: The certainty, in terms of confidence of a prediction, is provided


as an output from some models. Explanations that take into account this
measure of certainty are useful in many scenarios.

• Degree of Importance: Reporting which features are most important con-


tributors to a prediction is an important property to consider for expla-
nations. Methods that delineate features in this fashion better may be
considered better explanations.

• Novelty: Does the explanation take into account and/or report whether a
data instance to be explained comes from a region far removed from the
distribution of the training data? In such cases, the model may be inaccurate
and the explanation may be useless. The concept of novelty is related to the
concept of certainty. The higher the novelty, the more likely it is that the
model will have low certainty due to lack of data.

• Representativeness: How many instances does an explanation cover? Ex-


planations can cover the entire model (e.g. interpretation of weights in a
linear regression model) or represent only an individual prediction (e.g.
Shapley Values).

While it may not always be possible in practice to quantify all of these proper-
ties, they form a strong foundation to aid the design of interpretability methods.
In addition to these properties for individual explanations, it is also useful to
consider properties of explanation methods. The reader is referred to Section
3.5 of Molnar [137] (Properties of Explanations) for further reading on this topic.
Doshi-Velez and Kim [52] also provide good descriptions of taxonomies of inter-
pretability evaluation, considering the application, human users, and functional
tasks, which may provide valuable insights into the philosophy of interpretability
for the interested reader.

3.2 interpreting a linear regression model


Some models are interpretable by construction. Linear regression, logistic regres-
sion, and decision trees are considered interpretable by construction because of
3.2 interpreting a linear regression model 25

the simple relationship between their inputs and outputs. The easiest way to
achieve interpretability in a system is to use one of these as the predictive model.
Let us understand how by deep-diving into interpreting a linear regression model.
Linear regression is a linear approach for modelling the relationship between a
scalar response (also known as a dependent variable) and one or more explanatory
variables (also known as independent variables). A linear regression model
predicts the target as a weighted sum of the feature inputs. The linearity of the
learned relationship makes the interpretation easy [137].
Stating this mathematically, given that we have an input vector x = ( x1 , x2 , . . . x p )
and want to predict a real-valued output y, the linear regression model has the
form
p
f (x) = β 0 + ∑ x j β j (3.1)
j =1

Here, the β j ’s are unknown parameters (that are to be learned/estimated from


training data) or coefficients, and the variables x j (also called input features) can
come from different sources: quantitative inputs, transformations of quantitative
inputs, such as log, square-root, or square, basis expansions such as x2 = x12 , nu-
meric coding of qualitative inputs, or interactions between variables, for example
x3 = x1 · x2 . The first weight in the sum (β 0 ) is called the intercept and is not
multiplied with a variable. No matter the source of x j , the model is linear in the
parameters [83]. In this thesis, learned/estimated parameters are denoted by a
hat (such as β̂ j ), while unknown parameters are denoted without a hat (such as
β j ).
Due to the linear relationship between the input features and the output
in a linear regression model, the weights β provide a straightforward way to
interpret the model (for visualisation, weights may be plotted, see Figure 3.3a).
The interpretation of a particular weight in the model depends on the type of
corresponding feature.

• Numerical feature: Proportional increase – the estimated outcome increases


proportionally to the increase in the feature value with the weight being the
proportionality factor.

• Binary feature: A feature that takes one of two possible values for each
instance. It is represented by numerical value 1 if the feature is present, else
with 0. It changes the estimated outcome by the feature’s weight when its
value is 1.

• Categorical feature with multiple values: This type of feature is typically


encoded as multiple binary features, with each category being one binary
feature. The interpretation for each category is the same as the interpretation
of binary features.

• Intercept β 0 : The interpretation of the intercept is meaningful when the


features have been standardised (zero mean and unit standard deviation).
Then the intercept reflects the predicted outcome of an instance where all
features are at their mean value.
3.3 interpreting black-box models using lime 26

3.2.1 Feature Importance

A useful metric to determine feature importance is the absolute value of the


t-statistic. The t-statistic is the estimated weight scaled with its standard error.

β̂ j
t β̂ j = (3.2)
SE( β̂ j )

Standard error is the estimate of the standard deviation of the estimated


weight β̂ j . It can be thought of as a measure of the precision with which the
regression coefficient is estimated, with lesser standard error meaning a more
precise estimate. Intuitively, the higher an estimated weight β̂ j is, the more
important it is; however, if the estimate of β̂ j has a high variability, it dilutes the
importance factor.

3.2.2 Effect Plot

The estimated weights depend on the scale of the features and thus analysing the
weights alone may not always be a meaningful approach for feature importance.
To get around the issue of differing scales, we can calculate the effect a feature has
on the output, which is simply the value of the feature multiplied by its weight:

(i ) (i )
effect j = β̂ j x j (3.3)

(i )
where x j is the value of feature j for instance i and β̂ j is the weight estimate
of the feature. The feature effects for an entire dataset are plotted as boxplots to
depict the amount and range of effect each feature has on the output, as shown
in Figure 3.3b. Note that weight and effect may have opposite signs.

3.3 interpreting black-box models using lime


LIME (Local Interpretable Model-agnostic Explanations), first introduced in [155],
is a technique that produces local explanations for any model. It does this by
fitting a surrogate interpretable model (such as a linear regression model) on
perturbed versions of the sample under test. Given an input sample x ∈ Rd ,
0
the method first produces an interpretable representation x0 ∈ Rd of the input,
where d is the number of features in the original input and d0 is the number of
interpretable components derived from original (non-interpretable) input features.
The perturbations are then generated by switching on and off the interpretable
components. This allows us to assign feature importance values to these inter-
pretable components using the surrogate model. If the surrogate model is a
sufficiently close local approximation to the original model, then the generated
explanations are assumed to be valid. We will describe more evaluation metrics
for explanations in Section 3.4. For an image as an input, these interpretable
3.4 evaluation of feature-based explanations 27

Feature C

Feature B

Feature A

(a) Weight plot

Negative effect Positive effect

Feature C

Feature B

Feature A

(b) Effects plot

Figure 3.3: Visualising weights of a linear model corresponding to three features and
their effects on the output. Effects are calculated by multiplying feature values
by weight for all instances in a dataset. Note that weight and effect may have
opposite signs.

components are typically superpixels, aggregated by image segmentation algo-


rithms. The idea is that while the original input features (individual pixels) are
not interpretable, because they are far too low-level, entire image segments may
carry an intuitive meaning for a human user. This method is explained in detail
in Chapter 5 (Section 5.3).

3.4 evaluation of feature-based explanations


The explainability tasks relevant for this thesis are feature-based explanations:
methods that aim to delineate how much each input feature contributes to a
model’s output for a given data point. In this section, we discuss three quantitative
desiderata proposed by Bhatt et al. [27] for evaluating the quality of feature-
based explanations. These desiderata – low sensitiviy, high faithfulness, and low
complexity – can be thought of as ways to quantify some of the properties of
explanations described in Section 3.1.1 above.

3.4.1 Low Sensitivity

This is analogous to the high stability property. If inputs are close to each other
and their model outputs are similar, then their explanations should be close
3.4 evaluation of feature-based explanations 28

to each other as well. An explanation function g for a model f is expected to


have low sensitivity in the region around a point of interest x, implying local
smoothness of g. We can calculate average sensitivity as follows. (In the following,
the model inputs and the explanations are vectors in Rd )
Given a predictor model f , an explanation function g, a distance metric over
explanations D : Rd × Rd 7→ R+ , a distance metric over inputs ρ : Rd × Rd 7→
R+ , a radius r, a distribution Px (·) over the inputs centered at point x, the average
sensitivity µS of g at x is defined as:

Z
µS ( f , g, r, x) = D ( g( f , x), g( f , z))Px (z)dz (3.4)
z∈Nr

where Nr is the neighbourhood with radius r around x.


In practice, this quantity can be estimated from a given training set by sampling
a neighbourhood of samples from the dataset around the sample that we want
to explain and computing explanations for each sample. The average pairwise
distance between the explanations on the vector space Rd gives us the sensitivity,

3.4.2 High Faithfulness

Faithfulness (or fidelity) measures how closely the explanation function g reflects
the model being explained. One way to do this, as given in Bhatt et al. [27], is by
measuring the correlation between the sum of attributions (or importance values
assigned to features) of features xs of an input sample x and the difference in
the output of f when these features have been set to a reference baseline. For a
subset of indices T ∈ {1, 2, . . . d}, xt = {xi , i ∈ T } denotes a sub-vector of input
features that partitions the input, x = xt ∪ xc .
x[xt =x̄t ] denotes an input where xt is set to a reference baseline while xc remains
unchanged:

!
µ F ( f , g, x) = corr ∑ g ( f , x )i , f (x) − f (x[xt =x̄t ] ) (3.5)
i∈T

Another way, as described in [155], is to compute the coefficient of determina-


tion (denoted by R2 ) between the predictions of local model g (which in this case
is typically a simpler, interpretable surrogate model that approximates locally the
global model f ) and the global model f on N samples:

2
∑N
j =1 ( f ( x ) j − g ( x ) j )
2
µ F ( f , g, x) = R = 1 − (3.6)
∑N
j=1 ( f ( x ) j − f ( x ))
2

3.4.3 Low Complexity

An explanation that uses too many features to explain a model’s output may be
difficult for a human user to understand. It is thus desirable to obtain an “efficient”
3.5 explainability in music information retrieval 29

explanation – one that provides maximal information about the prediction using
minimal number of features. This often leads to a trade-off with fidelity, as using
all features to explain a prediction may be faithful to the model, but it may be too
complex for a user to understand. Bhatt et al. [27] define complexity using the
fractional contribution distribution:

| g ( f , x )i |
P g (i ) = ; Pg = {Pg (1), . . . Pg (d)} (3.7)
∑ j∈[d] | g( f , x) j |

where Pg (i ) is the fractional contribution of feature i to the output of f on an


input x as computed by g, and | g( f , x)i | is the absolute value of the attribution
score (weight) given by g to the feature i for the input x to model f .
If every feature had equal attribution, the explanation would be complex. The
simplest explanation would be concentrated on one feature. The complexity is
then defined as the entropy of Pg :

d
µC ( f , g, x) = Ei −ln(Pg ) = − ∑ Pg (i )ln(Pg (i ))
 
(3.8)
i =1

The criteria discussed above will play a role in Chapter 5 of this thesis, where
we will use these to evaluate explanations derived from LIME.

3.5 explainability in music information retrieval


Explainability has become an important aspect of the modern artificial intelligence
landscape, which encompasses real-world performance metrics, fairness, and
trustworthiness as additional desiderata. Artificial intelligence applied to musical
tasks is no different. While the immediate need for explainability in music may not
be as pressing as it is in more critical applications such as medicine or automotive
driving, there are potential applications and interesting research directions of
explainability worth pursuing in the field of music information retrieval. For
instance, one of the major current successes of this field is music and playlist
recommendations, powering the largest music streaming platforms handling
tens of millions of songs. There is a strong case to be made for explainability in
recommender systems as laid out in Afchar et al. [6], which includes addressing
unexpected or inappropriate recommendations ultimately leading to a positive
impact on user trust and forgiveness. An example of how such an explainable
recommender system may look like is visualised in Figure 3.4.
An issue in the music industry that often goes unnoticed is ethics. Mitigating
the harmful effects of algorithmic bias and fostering an ethically sound and
culturally sensitive atmosphere for people involved in production, distribution,
management, and ultimately consumption of music with the help of technol-
ogy is an important objective. Explainability is an important tool in the toolbox
of methods attempting to make progress in this direction [91]. Other fascinat-
ing applications of explainability include controllable AI music generation and
human-computer co-creation [38, 181], and detecting adversarial examples [153].
3.5 explainability in music information retrieval 30

FEATURE-BASED EXPLANATIONS

... it's based on 70's psychedelic rock music


<user taste cluster>

EXAMPLE-BASED EXPLANATIONS

We have built this playlist of


... it's based on Pink Floyd, The Doors, and Tangerine Dream
recommendations just for you, because ...
<hook> <hook> <discovery>

KNOWLEDGE-GRAPH-BASED EXPLANATIONS

... you love the band 13th Floor Elevators that pioneered
psychedelic rock in the 60's and we thought its continuation in the
70's may interest you

Figure 3.4: Examples of possible explanations for a music recommender system. Image
recreated from Afchar et al. [6].

The task of modelling musical emotion using machine learning has been on the
horizon of music information retrieval since its early days [176] but has received
major renewed interest in recent years owing to the development of end-to-end
deep learning models that ingest unstructured data such as audio spectrograms
and waveforms and learn to predict the given high-level musical attribute directly
[58, 79, 139]. This leads to an open problem of understanding what relations such
a model is learning between such unstructured input data and abstract output
concepts. In addition to the epistemic motivation of understanding these relations
in a musical sense, the need to ensure that such predictions can be trusted when
used in a downstream task prompts one to look within the black-box model using
techniques of explainability.
Some of the most interesting scientific questions lie at the intersection of disci-
plines. One such question, straddling between the fields of perception, psychology,
musicology, and computer science, relates to the disentanglement of the effects
of the performer from the effects of the underlying musical composition on per-
ceived emotion and musical expression [8, 68]. Can explainable machine learning
shed some light on the pursuit of this elusive idea that defies traditional analytic
methods and machine learning models? Let us take the first step in this direction
by modelling musical audio data through the lens of perceptually motivated
features in the following chapter.
Part II

MAIN WORK OF THIS THESIS


4
PERCEIVE: PREDICTING AND EXPLAINING
MUSIC EMOTION USING MID-LEVEL
P E R C E P T U A L F E AT U R E S

4.1 A Hierarchical View of Music Perception . . . . . . . . . . . . . . . 34


4.2 The Mid-level Bottleneck Architecture . . . . . . . . . . . . . . . . . 39
4.3 Data: Mid-level and Emotion Ratings . . . . . . . . . . . . . . . . . 42
4.4 Model Training and Evaluation . . . . . . . . . . . . . . . . . . . . . 46
4.5 Obtaining Explanations . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 54

How are emotions communicated through music?


This tantalising question has intrigued researchers whenever they have attempted
to describe music and its connection to the human psyche. Juslin and Västfjäll
[100] proposed six mechanisms through which music listening may induce emo-
tions, which include factors such as evaluative conditioning (association of music
with emotion due to previous repeated pairing of the music with other stim-
uli), visual imagery and episodic memory, in addition to musical expectancy.
While some of these factors are highly personal (memories) or culturally learned
(evaluative conditioning), there is significant evidence of the efficacy of emotion
communication regardless of cultural differences, mediated through features in
the music itself. Balkwill et al. [17] conducted studies of cross-cultural emotion
recognition where Japanese listeners rated the expression of joy, anger and sad-
ness in Japanese, Western, and Hindustani music. They also collected ratings
of acoustic cues that were then associated with the recognised emotion. The
listeners were found to be sensitive to the intended emotion in music from all
three cultures and the ratings of the cues were found to be consistent for each
emotion. High ratings of joy were associated with music judged to be fast in
tempo and melodically simple. High ratings of sadness were associated with
music judged to be slow in tempo and melodically complex. High ratings of
anger were associated with music judged to be louder and more complex. Thus,
perception of acoustic cues seem to transcend cultural boundaries.
Responses from people when asked about what they believed to have ‘caused’
their emotions associated with music can reveal the importance of various factors
in mediating emotion. Juslin et al. [99] reported that a majority of such responses
(45%) can be grouped under musical factors, e.g. ‘a good song’, ‘the singing

32
perceive: music emotion and mid-level features 33

voice’, ‘a fast tempo’, ’the melody’, ’the excellent performance’. Other factors
included situational factors (27%) e.g. ‘the weather’, memory factors (24%) e.g.
‘nostalgic recognition’, lyrics (10%), and pre-existing mood (9%). Since factors
such as a listener’s pre-existing mood and memory are not accessible to computer
systems, music emotion recognition systems typically use acoustic features that
are extracted from audio recordings using signal processing methods. Looking
at the listener responses for musical factors again, we note that the responses
relate to intuitive musical features such as ’the singing voice’, and ’the melody’,
which may not have obvious analytical definitions and thus are not accurately
represented (or extracted) using traditional signal processing methods. How, then,
could such human descriptions of musical elements be approximated using a
computer?
In this chapter, we approach music emotion modelling using mid-level perceptual
features. We propose a method to model perceived music emotion from audio
recordings using these features in a way that also provides explanations for the
emotion predictions. Mid-level features are qualities (such as rhythmic complexity,
or perceived major/minor harmonic character) that are musically meaningful
and intuitively recognisable by most listeners, without requiring music-theoretic
knowledge. It has been shown previously that there is considerable consistency in
human perception of these features, that they can be predicted relatively well from
audio recordings, and that they also relate to the perceived emotional qualities of
the music [9]. To incorporate interpretability into a deep learning music emotion
model using these features, we propose a bottleneck architecture, which first
predicts the mid-level features from audio, and consequently predicts the emotion
from these mid-level features using a linear regression model. Interpretability
is introduced in this scheme due to two factors: 1) a small number of musically
meaningful, perceptually motivated features as explanatory variables, and 2) the
linear regression part of the model, which is by construction an interpretable
model (see Section 3.2).
This chapter is organised as follows. We first look at music from a perceptual
standpoint in Section 4.1, where we discuss some of the previous research that
has gone into perceptual features for music. Next, in Section 4.2, we describe our
proposed bottleneck architecture and the three different schemes of training such
a model. Following this, in Section 4.3 we explore the datasets that we would
be using to train the explainable emotion model. We use the Mid-level Features
Dataset [9], and the Soundtracks dataset [54]. In Section 4.4, we describe the
training process in detail including performance metrics and our results. Finally,
in Section 4.5, we look at generating model-level and song-level explanations of
emotion predictions using weight plots and effects plots.
This chapter is broadly based on the following publication:

• S. Chowdhury, A. Vall, V. Haunschmid, G. Widmer


Towards Explainable Music Emotion Recognition: The Route via Mid-level
Features, In Proc. of the 20th International Society for Music Information Retrieval
Conference (ISMIR 2019), Delft, The Netherlands
4.1 a hierarchical view of music perception 34

4.1 a hierarchical view of music perception


Human auditory perception is multi-level, in the sense that signals from our ears
are transformed and processed in the brain across multiple levels of abstractions.
When we are listening to music, we hear the harmonics and recognise the timbre
of a sound, we hear notes, we hear how those notes are arranged to form a melody,
how that melody fits into the harmonic scheme, and how that harmony shifts over
time. We hear and contextualise events as they unfold through time, perceiving
some sort of rhythm [82]. The same applies for speech perception – the smallest
auditory units like formants, phonemes are transformed and processed resulting
in the perception of pitch, inflection and structure, which ultimately leads to
recognition of words and meaning [94]. As briefly suggested in the introduction
of this chapter, musical emotion is best understood as being mediated through
musical factors which lie at a middle-level in the hierarchy, such as features
relating to melody or rhythm (as opposed to, say, the harmonics of a note).
Motivated by the aim of explaining emotion predictions, we thus investigate these
mid-level features further in this section.

4.1.1 Low-level Features to Mid-level Features

Auditory perception is not only the passive reception of sensory signals, but it
is also shaped by learning, memory, expectation, and attention. Physical sonic
events are sensed by the ears, and these response signals from the inner ear
are transformed in the brain, thus manifesting as perception. In a musical con-
text, acoustic events are sensed and perceived as musical factors; for instance a
regularly-timed pulse train is perceived as having a certain rhythm [82].
The separation between the physical world of auditory events and the per-
ceptual experience of musical factors, as noted above, has been recognised in
music information retrieval (MIR) research as well. Traditionally, MIR applica-
tions have relied on extracting information from audio (usually in digital format)
using signal processing methods to describe the audio and music content at
the lowest (close-to-signal) level of representation. Features capturing this infor-
mation include time-domain features like amplitude envelope, energy content,
zero-crossing rate, frequency-domain features like spectral flux, spectral centroid,
mel-frequency cepstral coefficients (MFCCs), and the statistical properties of
these features across time. These features were typically engineered through
trial-and-error or intuition and have little relevance to how sound and music is
perceived by humans. This discrepancy has been referred to in the literature as
the semantic gap: the gap between these low-level descriptors and the auditory
concepts that listeners use to relate and interact with music [39]. Figure 4.1 depicts
the different levels of features starting from low-level features at the bottom to
semantic descriptors on the top.
This led to attempts to come up with better features more relevant in the
musical context. Some examples include using spectral envelopes to identify
timbral similarities [16, 121] and capturing useful features from the rhythm of a
song by using periodicity histograms [141] and temporal sequences [179]. More
4.1 a hierarchical view of music perception 35

Semantic Description
motional qualities

emotion genre preference

understanding

Context,
memory,
experience

Perceptual Features
rhythm
mode speed
harmony meter
melody
accentuation
pulse articulation

Low-level Features
intensity
loudness
pitch
note onsets
frequency
spectrum percussive events

Figure 4.1: A hierarchy of features, roughly depicting the experience of musical audition
in humans, adapted from Fig. 1 of Friberg et al. [64]. From an auditory signal,
we sense low-level features like pitch and intensity. These are then organ-
ised and interpreted by the brain into what we may refer to as “perceptual
features”, and subsequently processed into higher level aspects like emotion
and understanding, where context, memory, and experience also play a role.
This idea of multi-step processing of higher level aspects has been outlined
previously in Gabrielsson and Lindström [70].
4.1 a hierarchical view of music perception 36

recently, Panda et al. [144] published a comprehensive survey of features relevant


for music emotion recognition, based on relations between musical dimensions
and emotions. They looked at a total of eighty-five features ranging from pitch
range, key and key strength, to tempo, rhythmic fluctuation, onset and onset
density, loudness, and attack/decay times. These features were categorised under
melody, harmony, rhythm, dynamics, tone colour, expressivity, texture, and form.
The authors emphasised the relevance of higher level of features, and in fact some
of the novel features they proposed are computed with the aim of representing
the higher level concepts [145].
However, their methodology for extracting these features is analytical, and
based on clearly stated definitions of the features. They first convert the audio
signal into MIDI representation using automatic transcription algorithms, and
then the MIDI information (such as note length and pitch) are used to compute
features like melodic direction, note duration, articulation, and glissando. As a
result, the accuracy of this method relies on the performance of the transcription
algorithm, which restricts this approach to the kinds of music that can actually
be transcribed and that fit into the Western tonal system of music (unlike genres
like experimental noise music, hip-hop, or punk rock, which are not particularly
suited for transcription).
Some researchers have taken a different approach to modelling such musically
relevant features. Friberg et al. [64] proposed using features that describe the
overall properties of music instead of using concepts from music theory such as
tones, pitches, and chords. These perceptual features (see Figure 4.1) were analysed
by obtaining ratings of them from human raters in response to music they listened
to, and validated by measuring their predictive power to explain variation in
music emotion dimensions of energy and valence. Let us take a deeper look at
this work.

4.1.2 Friberg’s Perceptual Features

Friberg et al. [64] introduced a set of perceptually-motivated features, selected due


to their relevance in emotion research and by using the ecological perspective (the
hypothesis that humans try to understand the world from sound, i.e. understand
the source properties of the sounds rather than only considering the specific
timbre quality [65]). The features represent the most important and general
aspects in each of four main variables in music: time (tempo, rhythm, articulation),
amplitude (dynamics), pitch (melodic and harmonic aspects), and timbre. The
features studied by Friberg are1 :

1. Speed (slow – fast): Indicates the general speed of the music disregarding
any deeper analysis such as the tempo, and is easy for both musicians and
non-musicians to relate to [126].

2. Rhythmic Clarity (flowing – firm): Indicates how well the rhythm is accentu-
ated disregarding the actual rhythmic pattern. This would presumably be
similar to pulse clarity as modelled by Lartillot et al. [113].
1 Further details about these features can also be found in Friberg et al. [63].
4.1 a hierarchical view of music perception 37

3. Rhythmic Complexity (simple – complex): Relates to differences in rhythmic


patterns. For example, a rhythm pattern consisting of even eight notes would
have a low rhythmic complexity, while a syncopated eight-note pattern (as
in the salsa clave) might have a relatively high rhythmic complexity.

4. Articulation (staccato – legato). The overall articulation related to the duration


of tones and how transitions between them are connected. Articulation has
been verified in a number of studies as relevant for emotion communication
[70].

5. Dynamics (soft – loud): Indicates the played dynamic level disregarding


listening volume. It is presumably related to the estimated effort of the
player.

6. Modality (minor – major): Contrary to music theory, modality is rated here on


a continuous scale ranging from minor to major. This is to allow for modal
ambiguity and also to capture the assumption that not only the key but also
the prevalent chords (major or minor) influence the overall perception of
modality.

7. Overall Pitch (low – high): The simplest possible representation of melody,


i.e., its general pitch height.

8. Harmonic Complexity (simple – complex): A measure of how complex the


harmonic progression is. It might reflect, for example, the amount of chord
changes and deviations from a certain key scale structure. This is pre-
sumably a comparatively difficult feature to rate, that may demand some
musical training and knowledge of music theory.

9. Brightness or Timbre (dark – bright): A general timbre parameter, also known


as tone colour or tone quality. It is the perceived sound quality of a musical
note, sound, or tone. Brightness gives a rough indication of the amount of
high-frequency content in a sound.

Friberg conducted a listening experiment with musical stimuli consisting of a


set of ringtones and film music, with the participants rating each of the perceptual
features along a 9-step Likert scale. In a separate experiment, participants were
asked to rate emotions. For all the ratings, inter-rater agreement and correlations
were calculated. The inter-rater correlations were lower for the features that a
priori would be more difficult to rate, like harmonic complexity and rhythmic
complexity. Additionally, and somewhat unexpectedly, most of the feature ratings
showed modest correlation, and some, such as pitch and timbre showed a strong
correlation (correlation coefficient r = 0.90). The authors mention two probable
causes: covariation in the music examples, or listeners not being able to isolate
each feature as intended. Nevertheless, an interesting finding from these ratings
is that they appear to hold good predictive power for the emotion dimensions
of energy and valence. The perceptual features dynamics, speed, articulation, and
modality together explained 91% variance in energy, while modality, dynamics, and
harmonic complexity explained 78% variance in valence.
4.1 a hierarchical view of music perception 38

The fact that a handful of features are able to explain a significant amount of
variation in emotion is important for us from an interpretability perspective (recall
the property of comprehensibility from Section 3.1.1 and the complexity metric
from Section 3.4). But how accurately can these perceptual features be predicted
from audio content? Friberg experimented predicting the perceptual feature
ratings with several low-level features extracted from the audio using tools such
as MIRToolbox [114]. These features included some of the features mentioned
earlier – zero crossing rate, MFCCs (Mel-Frequency Cepstral Coefficients), spectral
centroid, spectral flux, RMS, event density, silence ratio, pulse clarity, etc. However,
it was found that these features are not able to model the perceptual features
near as well as desired. In best cases, about 70% variation in a perceptual feature
was explained by low-level features. This motivates the development of models
targeted specifically toward each perceptual feature.

4.1.3 Models for Perceptual Features

Noting that perceptual features are not predicted sufficiently well from audio
using low-level features, we turn to deep end-to-end models. These models are
known to learn relevant features from data. The current state of the art in audio
machine learning involves time-frequency
The concept of perceptual features for emotional expression in music has been
discussed prior to this work. Juslin’s lens model [96] considers proximal cues as a
medium of emotion transmission, which are similar to the perceptual features
of Friberg. Gabrielsson and Lindström [70] also described the idea of multi-level
processing of emotional expression in music. However, Friberg’s work is one of
the first to use a direct determination of these perceptual features using ratings
obtained from human participants.
Friberg conducted a listening experiment with musical stimuli consisting of a
set of ringtones and film music, with the participants rating each of the perceptual
features along a 9-step Likert scale. In a separate experiment, participants were
asked to rate emotions. For all the ratings, inter-rater agreement and correlations
were calculated. The inter-rater correlations were lower for the features that a
priori would be more difficult to rate, like harmonic complexity and rhythmic
complexity. Additionally, and somewhat unexpectedly, most of the feature ratings
showed modest correlation, and some, such as pitch and timbre showed a strong
correlation (correlation coefficient r = 0.90). The authors mention two probable
causes: covariation in the music examples, or listeners not being able to isolate
each feature as intended.
Nevertheless, an interesting finding from these ratings is that they appear to
hold good predictive power for the emotion dimensions of energy and valence.
The perceptual features dynamics, speed, articulation, and modality together ex-
plained 91% variance in energy, while modality, dynamics, and harmonic complexity
explained 78% variance in valence.
The fact that a handful of features are able to explain a significant amount of
variation in emotion is important for usrepresentations of audio (spectrograms)
as image-like inputs to convolutional neural networks. We choose to use models
4.2 the mid-level bottleneck architecture 39

based on CNNs (convolutional neural networks) to learn to extract perceptual


features from audio using data. In Aljanaki and Soleymani [9], a large convolu-
tional neural network, Inception v3 [169], was used to learn an audio to mid-level
perceptual feature model using a dataset also released by the authors of that
paper. We discuss the dataset (the Mid-level Features Dataset) and results from
that paper later on in Section 4.3 and Section 4.4. A summary of all datasets used
in this thesis is provided in Appendix a.
Recall that our main goal is to predict emotion from audio while using percep-
tual features as explanatory variables. In order to do this, we propose a bottleneck
model that learns to extract perceptual features first, and then uses these features
to predict the final emotion values. We describe this architecture in the following
section.

4.2 the mid-level bottleneck architecture


Our basic requirements for constructing an explainable model of emotion are 1)
the independent variables should be musically meaningful and 2) their effects
on the emotion prediction should be interpretable. We saw in the previous
section that perceptual features form a good basis for modelling emotion. Thus,
it makes sense to use the perceptual features (or “mid-level features”2 ) as the
set of independent explanatory variables. Since these features are not predicted
sufficiently well from low-level analytical features, we propose3 an architecture
that allows us to learn mid-level features using a deep CNN and then use the
output of this CNN as the basis for predicting emotion. This takes the form of
a bottleneck model, with the output of the CNN being the bottleneck layer that
outputs mid-level feature values. See Figure 4.2 for a visual representation.

4.2.1 Architecture Definition

Consider a two-part model fˆ ◦ ĝ where ĝ : Rd 7→ Rk maps an input x ∈ Rd to the


k-dimensional mid-level space, and fˆ : Rk 7→ R p maps the mid-level features to
the final p emotion outputs. In our case, d is the total number of pixels in an input
spectrogram image, k = 7 as seven mid-level features are available in our training
dataset obtained from [9], and p = 8 as labels for eight emotions are available in
our training dataset obtained from [54]. As is evident, the final emotion outputs

2 The terms “mid-level features” and “perceptual features” have been used interchangeably here.
Going forward, we will just use the term “mid-level features”. In and after Section 4.3, where seven
specific mid-level features are introduced, “mid-level features” will refer particularly to those seven
features, unless mentioned otherwise.
3 Our model architecture along with experimental results were first published in Chowdhury et al.
[45]. The following year, Koh et al. [107] published a paper formally defining “Concept Bottleneck
Models” in the computer vision domain. While the basic idea of the architectures and the training
schemes are essentially identical in both, in this thesis we adopt the naming convention for the
training schemes (Section 4.2.2) from Koh et al. [107], since their names for the training schemes
are more general. One important difference between our architecture and theirs is that in our case,
we assert for the mapping between the mid-level (or concept) layer and the final output layer to be
linear, in order to have this mapping fully interpretable.
4.2 the mid-level bottleneck architecture 40

Convolutional
Feature Extractor

Figure 4.2: The mid-level bottleneck model learns to map inputs to mid-level features
on an intermediate layer, and subsequently predicts the final emotion values
using these mid-level feature values. The connection between the mid-level
layer and the emotion output is linear, lending interpretability in terms of
learnt weights.

are derived entirely from the mid-level feature layer, thus relying completely on
the information passing through this bottleneck. In our implementations, the
first part of the model, ĝ, consists of a convolutional (feature-extractor) part, and
an adaptive pooling and linear mapping part denoted as φ̂. The purpose of φ̂
is to map the features extracted by the convolutional model (which may have
a dimensionality not equal to k) to the k-dimensional mid-level space. Further,
we choose f to be a linear model as well, allowing the mid-level-to-emotion
part of the model to be completely transparent. We shall see in Section 4.5 how
explanations can be derived using this linear architecture.

4.2.2 Three Distinct Operating Schemes

Assuming that we have labelled data for the mid-level and the emotion task, and
that both of these are regression problems, the bottleneck architecture described
above can be trained in three different ways. We have the training data points
{(x(i) , y(i) , m(i) )}in=1 , where m ∈ Rk is a vector of k mid-level features. Let LM :
Rk × Rk 7→ R+ be a loss function that measures the discrepancy between the
predicted and true mid-level feature vectors, and let L E : R p × R p 7→ R+ measure
the discrepancy between predicted and true emotion vectors. Then, we have the
following ways to learn the bottleneck model4 ( fˆ, ĝ):

1. The Independent Scheme: The model components f and g are trained inde-
pendently using the respective ground-truth data: fˆ = arg min f L E ( f (m), y),
and ĝ = arg ming L M ( g(x), m). Note that during training time, f is trained
with the ground truth values of mid-level features, while during test time,
it takes ĝ(x) as the input.

2. The Sequential Scheme: The model components f and g are trained se-
quentially. While ĝ is learned in the same way as above, fˆ is learned by

4 A model symbol with a hat, such as fˆ, denotes a trained model


4.2 the mid-level bottleneck architecture 41

INDEPENDENT

SEQUENTIAL

JOINT

Figure 4.3: Three schemes for training the bottleneck model. In the independent scheme,
the two model parts are trained independently from their respective datasets.
In the sequential scheme, f takes the outputs of trained ĝ as its inputs. In the
joint scheme, the entire model is trained as a whole by combining the loss
signals from both outputs.

training on ĝ(x), the mid-level predictions: fˆ = arg min f L E ( f ( ĝ(x)), y).


During test time, fˆ takes ĝ(x) as the input, as before.

3. The Joint Scheme: The model components f and g are trained jointly
similar to a multi-task optimisation process. The losses of both the tasks
are added (with the relative weights controlled by a parameter λ) and
the entire model is optimised together. ( fˆ, ĝ) = arg min f ,g [λL M ( f (x), m) +
L E ( f ( ĝ(x)), y)].

These training schemes are visualised in Figure 4.3. We refer to a bottleneck


model with the notation “A2Mid2E”, signifying that the direction of flow of infor-
mation is Audio → Mid-level → Emotion. The training scheme is specified with
a subscript: A2Mid2Eind , A2Mid2Eseq , and A2Mid2Ejoint . We also train a baseline
standard model, which ignores the mid-level features and learns to predict the
emotions directly from the spectrograms: ( fˆ, ĝ) = arg min f ,g L E ( f ( g(x)), y). This
model is referred to as “A2E”, since the emotions are predicted directly from the
audio input: Audio → Emotion.
4.3 data: mid-level and emotion ratings 42

4.2.3 Constructing a Bottleneck Model

Any end-to-end neural network could be converted to a bottleneck model by


resizing the bottleneck layer to match the number of mid-level features. We start
off by using a VGG-ish model similar to the one in [51] as the backbone, which
we modify to improve performance and resize the penultimate layer to match
the number of mid-level features available to us (in our experiments, we use the
Mid-level Features Dataset [9], described later in Section 4.3).
More recently, a receptive-field regularised residual neural network was shown
to work well for audio applications [108], which we also incorporate into our
experiments, and we refer to it as the RF-ResNet architecture.

Backbone Model: VGG-ish

While the Inception v3 model used by Aljanaki and Soleymani [9] gives decent
performance in modelling mid-level features, it relies on large-scale pre-training
with additional data followed by fine-tuning on the mid-level data. We chose to
instead use a smaller architecture as our backbone to avoid the pre-training step.
We start with the architecture used by Dorfer and Widmer [51] and modify the
layers until we see no further improvement in validation set performance. Effec-
tively, we find that making the model shallower but wider improves performance.
This architecture is what we refer to in this thesis as “VGG-ish”. The layers of this
architecture are shown in Table 4.1a.

Backbone Model: RF-ResNet

Residual neural networks [86] have been the architecture of choice in the computer
vision domain ever since their state-of-the-art performance in the Imagenet [48]
challenge. However, their use in the audio domain has been limited.
Koutini et al. [108] identified that if we reduce the receptive field5 of a typical
ResNet, it improves performance on audio tasks, even beating the VGG-ish
models. We use this architecture as our second backbone model and referred
to as the Receptive-Field Regularised ResNet, or “RF-ResNet”. The layers of this
architecture are shown in Table 4.1b.

4.3 data: mid-level and emotion ratings

4.3.1 The Mid-level Features Dataset

Motivated by Friberg’s work, Aljanaki and Soleymani [9] proposed a data-driven


approach for mid-level feature modelling, where they approximated the percep-
tual concepts using large-scale crowd-sourced listener ratings. They released their
data in the Mid-level Perceptual Features Dataset6 , which we use extensively in this

5 The receptive field of a convolutional network is defined as the size of the region in the input that
produces the feature [12].
6 https://osf.io/5aupt/
4.3 data: mid-level and emotion ratings 43

Input (469 × 149) Input (469 × 149)

Conv2D (k=5, s=2, p=2) [64] Conv2D (k=5, s=2, p=1) [128]
BatchNorm2D [64] + ReLU BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [64]
Conv2D (k=3, s=1, p=1) [128]
BatchNorm2D [64] + ReLU
BatchNorm2D [128] + ReLU
MaxPool2D (k=2) + DropOut (0.2) Conv2D (k=1, s=1) [128]
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [128]
MaxPool2D (k=2, s=2)
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [128] Conv2D (k=3, s=1, p=1) [128]
BatchNorm2D [128] + ReLU BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1) [128] ×2
MaxPool2D (k=2) + DropOut (0.2)
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [256] MaxPool2D (k=2, s=2)
BatchNorm2D [256] + ReLU
Conv2D (k=3, s=1, p=1) [256]
Conv2D (k=3, s=1, p=1) [256]
BatchNorm2D [256] + ReLU
BatchNorm2D [256] + ReLU
Conv2D (k=3, s=1) [256]
Conv2D (k=3, s=1, p=1) [384]
BatchNorm2D [256] + ReLU
BatchNorm2D [384] + ReLU
Conv2D (k=3, s=1, p=1) [512] Conv2D (k=1, s=1, p=1) [512]
BatchNorm2D [512] + ReLU BatchNorm2D [512] + ReLU
Conv2D (k=3, s=1, p=1) [256] Conv2D (k=1, s=1) [512]
BatchNorm2D [256] + ReLU BatchNorm2D [512] + ReLU

Adaptive Avg Pool 2D (1,1) Adaptive Avg Pool 2D (1,1)


Linear (256, 7) −→ (Mid-level Out) Linear (512, 7) −→ (Mid-level Out)

Linear (7, 8) −→ (Emotion Out) Linear (7, 8) −→ (Emotion Out)

(a) VGG-ish (b) RF-ResNet

Table 4.1: Our two backbone model architectures. (a) shows the VGG-ish model and
(b) shows the receptive-field regularised residual neural network model (RF-
ResNet). The numbers in square brackets represent the number of channels
at the output of the corresponding layer. The blocks marked with border in
the RF-ResNet model are the residual blocks, with direct identity connections
between the input and output (not shown). k: kernel size, s: stride, p: padding.

thesis. Thus, now would be a good opportunity to take a deeper look into this
dataset.

Definitions of the Mid-level Concepts

While Friberg et al. studied nine perceptual features, Aljanaki et al. use a reduced
set of seven (described in Table 4.2). They select their feature set from concepts
found recurring in literature [62, 69, 174].
4.3 data: mid-level and emotion ratings 44

perceptual feature question asked to human raters

1 Melodiousness To which excerpt do you feel like singing along?

2 Articulation Which has more sounds with staccato articula-


tion?

3 Rhythmic Stability Imagine marching along with the music. Which


is easier to march along with?

4 Rhythmic Complexity Is it difficult to repeat by tapping? Is it difficult


to find the meter? Does the rhythm have many
layers?

5 Dissonance Which excerpt has noisier timbre? Has more


dissonant intervals (tritones, seconds, etc.)?

6 Tonal Stability Where is it easier to determine the tonic and key?


In which excerpt are there more modulations?

7 Modality (‘Minorness’) Imagine accompanying this song with chords.


Which song would have more minor chords?

Table 4.2: Perceptual mid-level features as defined in [9], along with questions that were
provided to human raters to help them interpret the concepts. (The ratings
were collected in a pairwise comparison scenario.) In the following, we will
refer to the last one (Modality) as ‘Minorness’, to make the core of the concept
clearer.

Although the names of these concepts are derived from musicology, their use
in the current context is not restricted to their formal definitions found in their
original context. For instance, the concept of articulation is defined formally for a
single note (it can also be extended to a group of notes). However, applying it to a
real-life recording with possibly several instruments and voices is not an easy task.
To ensure common understanding between the raters, a pairwise comparison
based strategy was adopted (described in Section 4.3.1), where participants were
asked to listen to two audio clips and rank them according to the questions listed
in Table 4.2. These questions also make up the “definitions” of the concepts, as
used in the current context. The general principle is to consider the recording as
a whole.

Music Selection

The music (in the form of audio files) in the dataset comes from five sources:
Jamendo (www.jamendo.com), Magnatune (magnatune.com), the Soundtracks
dataset [54], the Bi-modal Music Emotion dataset [128], and the Multi-modal
Music Emotion dataset [143]. Overall, there was a restriction of no more than five
songs from the same artist. From each selected song, a 15-second clip from the
4.3 data: mid-level and emotion ratings 45

1000
800

Count 600
400
200
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
melodiousness articulation rhythm_complexity rhythm_stability
1000
800
Count

600
400
200
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
dissonance tonal_stability minorness

Figure 4.4: Distribution of annotations in the Mid-level Features Dataset [9].

middle of the song was extracted to construct the set of audio samples to be rated
by the human participants.

Obtaining Ratings through Crowd-Sourcing

Comparing two items using a certain criterion is easier for humans than giving
a rating on an absolute scale [127]. Based on this, Aljanaki and Soleymani [9]
used pairwise comparisons (according to the seven mid-level descriptors listed
in Table 4.2) to get rankings for a small subset of the dataset, which was then
used to create an absolute scale on which the whole dataset was then annotated.
The annotators were required to have some musical education and were selected
based on passing a musical test. The ratings range from 1 to 10 and were scaled
by a factor of 0.1 before being used for our experiments. The distributions of the
annotations for the seven mid-level features are plotted in Figure 4.4

4.3.2 The Soundtracks Dataset

The Soundtracks7 (Stimulus Set 1) dataset, published by Eerola and Vuoskoski [54],
consists of 360 excerpts from 110 movie soundtracks. The excerpts come with
expert ratings for five emotions following the discrete emotion model (happy,
sad, tender, fearful, angry) and three emotions following the dimensional model
(valence, energy, tension). This makes it a suitable dataset for musically conveyed
emotions [54]. The ratings in the dataset range from 1 to 7.83 and were scaled
by a factor of 0.1 before being used for our experiments (see Figure 4.5 for the
distributions). All the songs in this set are also contained in the Mid-level Features
Dataset, and are annotated with the mid-level features, giving us both emotion
and mid-level ratings for these songs.

7 https://www.jyu.fi/hytk/fi/laitokset/mutku/en/research/projects2/past-projects/coe/
materials/emotion/soundtracks
4.4 model training and evaluation 46

50

40

30

Count
20

10

0
2 4 6 2 4 6 2 4 6
valence energy tension

(a) Dimensional emotions

40
Count

20

0 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8
anger fear happy sad tender

(b) Categorical emotions. The magnitudes reflect how intensely an emotion is perceived for a song.

Figure 4.5: Distribution of ratings in the Soundtracks dataset [54].

4.4 model training and evaluation


We are now ready to train our models. Here, we will first define our performance
metrics and goals, followed by a description of our audio preprocessing step.
Before we proceed to train the explainable bottleneck models, we will train
and evaluate our baseline models, which predict only the mid-level features,
or only the emotions. We compare our baselines to those obtained by Aljanaki
and Soleymani [9]. We then describe the training process of the three schemes –
independent, sequential, and joint.

4.4.1 Metrics and Performance Goals

In order to enable comparisons with Aljanaki and Soleymani [9], our main
performance metric is going to be the same as the one used in that paper: Pearson
correlation coefficient (r) between the predicted and actual values of mid-level
features or emotions, as may be the case. Additionally, we also measure the
coefficient of determination (R2 -score), and the root mean squared errors (RMSE).
The performance metrics are averaged over 8 runs and the mean values are
reported. We would like our models to maximise the average Pearson correlation
coefficient across all mid-level features and across all emotions. An important
metric we desire from our bottleneck models is that they must not hamper
the performance of the main task (emotion prediction), compared to a non-
bottleneck model. Moreover, the performance of mid-level feature prediction
by the bottleneck model should be comparable to that from a non-bottleneck
model, for valid explanations. Thus, we require that the cost of explainability be
reasonably low for both these tasks. The Pearson correlation coefficient and the
Cost of Explainability (CoE) are defined below.
4.4 model training and evaluation 47

Pearson correlation coefficient

The Pearson correlation coefficient r measures the linear relationship between two
datasets. Given paired data {( x1 , y1 ), . . . , ( xn , yn )} consisting of n pairs, r ( x, y) is
defined as:

∑in=1 ( xi − x̄ )(yi − ȳ)


r ( x, y) = p (4.1)
∑in=1 ( xi − x̄ )2 ∑in=1 (yi − ȳ)2
p

where n is sample size, xi , yi are the individual sample points indexed with i,
and x̄ = n1 ∑in=1 xi (the sample mean); and analogously for ȳ.
It varies between −1 and +1 with 0 implying no correlation. Correlations of
−1 or +1 imply an exact linear relationship. Positive correlations imply that as x
increases, so does y. Negative correlations imply that as x increases, y decreases.

Cost of Explainability (CoE)

Due to the fact that we introduce a bottleneck (viz. the 7 mid-level predictions) as
inputs to the subsequent linear layer predicting emotions, our hypothesis is that
doing so should result in a decrease in the performance of emotion prediction,
relative to an A2E model that predicts emotion directly. We calculate this cost as
the difference in performance metrics between the two models for each emotion.
More precisely, we subtract the metric for the bottleneck model from the end-to-
end A2E model:

CoE = µA2E − µA2Mid2E (4.2)

where µA2E is the performance metric for an emotion as obtained using the A2E
model, and µA2Mid2E is the performance metric for the emotion as obtained using
the A2Mid2E model. When the performance metric is the Pearson correlation
coefficient or the R2 -score, a positive CoE will indicate a reduction in performance
for that emotion caused by introducing the bottleneck, whereas a negative CoE
will indicate an improvement.

4.4.2 Preprocessing

The Mid-level Dataset contains audio clips 15 seconds in length each. In order to
be fed into our convolution-based models, we convert these to magnitude spectro-
grams. Each audio snippet is resampled at 22.05 kHz, with a frame size of 2048
samples and a frame rate of 31.3 frames per second, and amplitude-normalised
before computing the logarithmic-scaled spectrogram. The spectrograms are com-
puted using the LogarithmicFilteredSpectrogramProcessor from the madmom
[29] package, with a frequency resolution of 24 bands per octave, resulting in 149
frequency bands. We experimented with different snippet lengths to be fed into
the model, and found that longer snippets perform better. Therefore, we use the
entire 15 seconds of available audio for each clip, resulting in spectrograms of
size 469 × 149.
4.4 model training and evaluation 48

0.7

Average Correlation Coefficient


0.6

0.5

0.4 VGG-ish
RF-ResNet
0.3 0 10 20 30 40 50 60 70
Epoch

Figure 4.6: Training curve for mid-level baseline models. The RF-ResNet architecture
trains faster and gives a better performance than the VGG-ish model.

4.4.3 Model Training Parameters

We train our models using the Adam optimiser [105] with a learning rate of 0.001
and a batch size of 8. After hyperparameter optimisation, we arrive at optimal
values of β 1 = 0.73 and β 2 = 0.918 for the two moment parameters of the Adam
optimiser, which are different than the default values of 0.9 and 0.999, respectively,
typically recommended in the literature.
Using a learning rate scheduler on top of the Adam optimiser results in signifi-
cant performance gains. In our experiments, we found that cosine annealing with
warm restarts [122] gave the biggest performance improvement. We additionally
use early stopping on the R2 -score of the validation set to prevent over-fitting.
The Mid-level Dataset is split into train-validation-test sets as described in
Aljanaki and Soleymani [9], with 8% of the data held out as the test set with no
common artists with the train or validation sets. The validation set is constructed
with 2% of data samples.

4.4.4 Baselines

Aljanaki and Soleymani [9] demonstrated modelling mid-level features from


audio using an Inception v3 [168] model, which they pre-trained using a tag-
prediction task with data sourced from www.jamendo.com (no overlap with the
Mid-level dataset). After pre-training, they fine-tuned it on data from the Mid-
level dataset. The performance metrics of this experiment are used as a baseline
for mid-level feature modelling and are given in Table 4.3.
Aljanaki and Soleymani [9] also modelled the emotions in the Soundtracks
dataset with the annotated mid-level features for the songs in this dataset. This is
analogous to training the model f in our independent model training scheme. The
results of this experiment are given in Table 4.3.
We compute two baselines of our own. First, we use our backbone models
described in Section 4.2.3 to model mid-level features using the Mid-level Dataset.
4.4 model training and evaluation 49

Mid-level feature Aljanaki A2MidVGG-ish A2MidRF-ResNet

1 Melodiousness 0.70 0.70 0.72


2 Articulation 0.76 0.84 0.85
3 Rhythmic Stability 0.46 0.39 0.46
4 Rhythmic Complexity 0.59 0.66 0.66
5 Dissonance 0.74 0.75 0.77
6 Tonal Stability 0.45 0.47 0.51
7 Minorness 0.48 0.52 0.61

Avg. 0.60 0.62 0.66

Table 4.3: Pearson correlation coefficient between predictions and ground-truth values for
mid-level feature predictions using our models, compared with those reported
by Aljanaki and Soleymani [9].

Since our models are significantly smaller than the Inception v3 architecture, we
expect them to not overfit on the training data, which is a concern in the case of
the Inception v3 model. In our experiments we found that reducing the size and
complexity of the models improves the performance up to a point, and this per-
formance is already better than Aljanaki and Soleymani [9]’s baseline. Therefore,
while using our smaller and simpler models, pre-training is not necessary. The
metrics obtained using our models is given in Table 4.3, and the training curves
for the two backbone variants are shown in Figure 4.6.
Second, we use our backbone models to model the emotion directly, without
going through a mid-level bottleneck. This will be the basis of the non-bottleneck
part of the baseline, using which we will compute the cost of explainability. We
train this model by using the audio and emotion annotations from the Soundtracks
dataset.

4.4.5 Training and Evaluating the Independent Scheme

As described earlier in Section 4.2.2, the independent scheme trains the two mod-
els f and g separately, each using its own relevant annotated data. Model g learns
to predict mid-level features using the model training parameters mentioned
above, while f is a simple linear regression model that is optimised analytically.
During test time however, the trained model fˆ ingests the outputs from the
trained model ĝ. The performance metrics of this scheme are detailed in Table 4.5
(using VGG-ish backbone) and Table 4.6 (using RF-ResNet backbone). We observe
that the cost of explanation is maximum in this scheme. The loss in performance
of emotion prediction could be attributed to the difference in the distributions
of the mid-level feature values in the training data (ground truth values) vs. the
distribution of the mid-level features values at test time (values predicted by ĝ.
4.4 model training and evaluation 50

4.4.6 Training and Evaluating the Sequential Scheme

In the sequential scheme, instead of training f using the true mid-level feature
annotations as inputs, we use the outputs of the trained model ĝ as the inputs
to f . We expect the performance of the final emotion prediction to be better in
this case because f is trained on the actual distribution of ĝ(x), resulting in a
lower cost of explanation. The performance metrics of this scheme are given in
Table 4.5 (using VGG-ish backbone) and Table 4.6 (using RF-ResNet backbone).
We see that indeed the average performance of emotion improves and the cost of
explanation reduces.

4.4.7 Training and Evaluating the Joint Scheme

Finally, in the joint scheme, we use a multi-task approach to train the entire
network that, ideally, could learn an internal representation useful for both
prediction tasks, while keeping the interpretability of the linear weights. This
network learns to predict mid-level features and emotion ratings jointly, but
still predicts the emotions directly from the mid-level via a linear layer. This is
achieved by the second last layer having exactly the same number of units as there
are mid-level features (7), followed by a linear output layer with 8 outputs. From
this network, we extract two outputs – one from the second last layer (”mid-level
layer”), and one from the last layer (”emotion layer”). We compute losses for both
the outputs and optimise the combined loss (summation of both the losses). The
losses can be weighed differently using the parameter λ (in our experiments, we
found that λ = 2.0 gives the optimal performance for the two tasks). The results
of the joint training are presented in Table 4.5 (using VGG-ish backbone) and
Table 4.6 (using RF-ResNet backbone).

4.4.8 Results

First, from Table 4.4 we verify that our cross-validated performance metrics of
modelling emotion ratings using mid-level feature ratings match those of Aljanaki
and Soleymani [9]. Next, we look at the emotion prediction metrics using the
direct A2E model, and compare it to the bottleneck models (A2Mid2E) trained
using the three schemes: independent, sequential, and joint.
The results reflect the expected trends. In both Table 4.5 (VGG-ish backbone)
and Table 4.6 (RF-ResNet backbone), we see that the independent scheme results
in maximum cost of explanation, followed by the sequential and the joint, with
the joint training showing the minimum cost. A notable observation is that
the average correlation for emotion prediction using the mid-level annotations
(Table 4.4) is close to the average correlation for emotion prediction directly from
audio (A2E row in Table 4.5 and Table 4.6), which suggests that mid-level features
are able to capture as much information about emotion variation as a direct A2E
model with the full spectrograms as inputs. For both the VGG-ish backbone and
the RF-ResNet backbone models, the metrics for the jointly trained bottleneck
variant are very close to the corresponding direct A2E models.
4.5 obtaining explanations 51

Valence Energy Tension Anger Fear Happy Sad Tender Avg.

Mid2E (Aljanaki) 0.88 0.79 0.84 0.65 0.82 0.81 0.73 0.72 0.78
Mid2E (Ours) 0.88 0.80 0.85 0.68 0.83 0.82 0.75 0.73 0.79

Table 4.4: Modelling emotions in Soundtracks dataset using annotated mid-level feature
values. The numbers are Pearson correlation coefficient values obtained using
a linear regression model.

Valence Energy Tension Anger Fear Happy Sad Tender Avg.

A2E 0.81 0.79 0.84 0.82 0.81 0.66 0.60 0.75 0.76
A2Mid2Eind 0.66 0.69 0.65 0.66 0.67 0.57 0.43 0.52 0.61
A2Mid2Eseq 0.79 0.74 0.78 0.72 0.77 0.64 0.58 0.67 0.71
A2Mid2Ejoint 0.82 0.78 0.82 0.76 0.79 0.65 0.64 0.72 0.75

CoEind 0.15 0.10 0.19 0.18 0.14 0.09 0.17 0.23 0.15
CoEseq 0.02 0.05 0.06 0.10 0.03 0.02 0.02 0.08 0.05
CoEjoint −0.02 0.01 0.02 0.06 0.02 0.01 −0.04 0.03 0.01

Table 4.5: Summary of model performances on emotion prediction on the Soundtracks


dataset, using the VGG-ish backbone (see Section 4.2.3). The topmost row (A2E)
predicts emotions directly. The following three rows are the bottleneck models
trained independently (“ind”), sequentially (“seq”), and jointly (“joint”). The
three rows at the bottom give the corresponding “cost of explanation” (CoE)
incurred by the three bottleneck models.

Regarding the differences between the VGG-ish and the RF-ResNet models, we
see that the RF-ResNet performs slightly better in terms of A2E performance as
well as in terms of overall costs of explanation for all three variants of A2Mid2E.
Going forward, we will use the jointly trained version of the VGG-ish model, in
order to keep the results consistent with Chowdhury et al. [45]. The slight differ-
ence in performance for the two models does not greatly affect the explanation
process.

4.5 obtaining explanations


Since the mapping between mid-level features and emotions is linear in all three
proposed schemes (independent, sequential, and joint), it is now straightforward
to create human-understandable explanations. Linear models can be interpreted
by analysing their weights: increasing a numerical feature by one unit changes
the prediction by its weight. A more meaningful analysis is to look at the effects,
which are the weights multiplied by the actual feature values [137]. An effects
plot shows the distribution, over a set of examples, of the effects of each feature
on each target. Each dot in an effects plot can be seen as the amount this feature
contributes (in combination with its weight) to the prediction, for a specific
instance. Instances with effect values closer to 0 get a prediction closer to the
4.5 obtaining explanations 52

Valence Energy Tension Anger Fear Happy Sad Tender Avg.

A2E 0.83 0.88 0.82 0.89 0.82 0.65 0.71 0.72 0.79
A2Mid2Eind 0.75 0.76 0.71 0.66 0.71 0.68 0.56 0.62 0.68
A2Mid2Eseq 0.76 0.81 0.75 0.74 0.74 0.72 0.61 0.65 0.73
A2Mid2Ejoint 0.82 0.88 0.84 0.82 0.83 0.69 0.74 0.70 0.79

CoEind 0.08 0.12 0.11 0.23 0.11 −0.03 0.15 0.10 0.11
CoEseq 0.07 0.07 0.07 0.15 0.08 −0.07 0.10 0.07 0.04
CoEjoint 0.01 0.00 −0.02 0.06 0.00 −0.04 −0.03 0.02 0.00

Table 4.6: Summary of model performances on emotion prediction on the Soundtracks


dataset, using the RF-ResNet backbone (see Section 4.2.3). The topmost row
(A2E) predicts emotions directly. The following three rows are the bottleneck
models trained independently (“ind”), sequentially (“seq”), and jointly (“joint”).
The three rows at the bottom give the corresponding “cost of explanation” (CoE)
incurred by the three bottleneck models.

intercept (bias term). Figure 4.9 shows the effects for the joint model computed
over the held-out set.
The particular model, in this case, is the jointly trained variant of the VGG-ish
model from above. In the following, all the statistics and explanations will be
based on this model, in order for this presentation to be consistent with the results
published in Chowdhury et al. [45].
First we will show how this can be used to provide model-level explanations
and then we will explain a specific example at the song level.

4.5.1 Model-level Explanation

Before a model is trained, the relationship between features and response variables
can be analysed using correlation analysis. The pairwise correlations between
mid-level and emotion annotations in our data are shown in Figure 4.7a. When
we compare this to the effect plots in Figure 4.9, or the actual weights learned for
the final linear layer (Figure 4.7b) it can be seen that for some combinations (e.g.,
valence and melodiousness, happy and minorness) positive correlations go along
with positive effect values and negative correlations with negative effect values,
respectively. This is not a general rule, however, and there are several examples
(e.g., tension and dissonance, energy and melody) where it is the other way
around. The explanation for this is simple: correlations only consider one feature
in isolation, while learned feature weights (and thus effects) also depend on the
other features and must hence be interpreted in the overall context. Therefore
it is not sufficient to look at the data in order to understand what a model has
learned.
To get a better understanding, we look at each emotion separately, using the
effects plot given in Figure 4.9. In addition to the trend of the effect (positive
or negative) – which we can also read from the learned weights in Figure 4.7b
(but only because all of our features are positive) – we can also see the spread
4.5 obtaining explanations 53

valence

tension
energy

tender
happy
anger
fear

sad
minorness -0.39 -0.26 0.34 0.29 0.42 -0.75 0.41 -0.21
0.6
tonal_stability 0.72 -0.24 -0.67 -0.46 -0.66 0.49 0.23 0.52 0.4
dissonance -0.83 0.41 0.79 0.60 0.76 -0.44 -0.41 -0.64 0.2

rhythm_stability 0.34 0.13 -0.24 -0.14 -0.31 0.33 -0.02 0.13 0.0

rhythm_complexity -0.39 0.44 0.42 0.30 0.34 -0.08 -0.38 -0.34 0.2
0.4
articulation -0.36 0.75 0.46 0.41 0.26 0.11 -0.49 -0.50
0.6
melodiousness 0.78 -0.39 -0.73 -0.55 -0.74 0.37 0.46 0.61
0.8
(a) Pairwise correlation between mid-level and emotion annotations.
valence

tension
energy

tender
happy
anger
fear

sad
minorness -0.01 -0.17 0.47 0.15 0.28 -0.41 0.27 -0.01
0.4
tonal_stability 0.19 -0.26 0.21 -0.19 0.24 0.42 0.10 0.37

dissonance -0.31 0.03 0.05 0.18 0.57 -0.46 -0.14 0.01 0.2

rhythm_stability 0.06 0.04 -0.30 -0.21 -0.24 -0.16 0.36 0.08 0.0
rhythm_complexity 0.27 0.28 0.38 0.11 0.11 0.37 0.11 -0.03
0.2
articulation -0.04 0.44 0.24 0.25 0.01 0.18 -0.54 -0.20

melodiousness 0.48 0.10 -0.36 -0.14 -0.39 -0.01 0.01 0.47 0.4

(b) Weights from the linear layer of the A2Mid2EVGG-ish


joint model.

Figure 4.7: Comparing the correlations in the Soundtracks dataset with the learned
weights of fˆ mapping the mid-level feature space to the emotion space.

of the effect which tells us more about the actual contribution the feature can
have on the prediction, or how different combinations of features may produce
a certain prediction. Notably, we find many intuitive relationships between the
mid-level features and emotions. For example, we can see that minorness has
a large positive effect on “sad”, “tension”, and “anger” emotions, and a large
negative effect on “happy”. Another intuitive relationship reflected in the effect
plots is that “tender” has a large positive effect from the “melodiousness” and
“tonal stability” features.

4.5.2 Song-level Explanations

Effect plots also permit us to create simple example-based explanations that can be
understood by a human. The feature effects of single examples can be highlighted
in the effects plot in order to analyse them in more detail, and in the context
4.6 discussion and conclusion 54

Predicted/Groundtruth Value of Emotion


Song #153 Predictions
0.5 Song #153 Groundtruth
Song #322 Predictions
Song #322 Groundtruth
0.4

0.3

0.2

0.1
valence energy tension anger fear happy sad tender
Emotion

Figure 4.8: Emotion prediction profiles for the two example songs #153 and #322. These
two examples were chosen as they have similar emotion profiles but different
mid-level profiles. The mid-level feature effects are shown on the next figure
(Figure 4.9 on page 55) as red and blue points.

of all the other predictions. To show an interesting case we picked two songs
with similar emotional but different mid-level profiles. To do so we computed
the pairwise euclidean distances between all songs in emotion (dE ) and mid-level
space (dMid ) separately, scaled both to the range [0, 1] and combined them as
dcomb = dE − (1 − dMid ). We then selected the two songs from the Soundtracks
dataset that maximised dcomb . The samples are shown in Figure 4.9 as a red
square (song #153) and a blue dot (song #322). The reader can listen to the
songs/snippets by downloading them from the Soundtracks dataset page.
As can be seen from Figure 4.9 and from the emotion prediction profile of the
two songs (see Figure 4.8), both songs have relatively high predicted values for
tension and energy, but apparently for different reasons: song #322 more strongly
relies on “minorness” and “articulation” for achieving its “tense” character; on
the other hand, its rhythmic stability counteracts this more strongly than in the
case of song #153. The higher score on the “energy” emotion scale for #322 seems
to be primarily due to its much more articulated character (which can clearly be
heard: 153 is a saxophone playing a chromatic, harmonically complex line, 322 is
an orchestra playing a strict, staccato passage).

4.6 discussion and conclusion


Our experiments in this chapter suggest that it is possible to gain usable and
musically meaningful insights about emotion prediction with a mid-level feature
bottleneck model without significant loss in model performance. The bottleneck
architecture we propose is simple and re-usable, and could be applied to other
tasks where both types of annotations (mid-level features and related high-level
labels) are available.
Model interpretability and the possibility to obtain explanations for a given
prediction are not ends in themselves. There are many scenarios where one may
153.mp3 valence energy tension anger
minorness
tonal_stability 322.mp3
dissonance
rhythm_stability
rhythm_complexity
articulation
melodiousness
0.2 0.0 0.2 0.4 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.2 0.0 0.2 0.4 0.2 0.1 0.0 0.1 0.2
fear happy sad tender
minorness
tonal_stability
dissonance
rhythm_stability
rhythm_complexity
articulation
melodiousness
0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.2 0.1 0.0 0.1 0.2 0.3 0.4
Figure 4.9: Effects of each mid-level feature on prediction of emotion. The boxplots show the distribution of feature effects
of the model A2Mid2EVGG-ish
joint helping us to understand the model globally. Additionally, two example songs
(blue dots, red squares) are shown to provide song-level explanations (see Section 4.5.2 for a discussion).
4.6 discussion and conclusion
55
4.6 discussion and conclusion 56

need to understand why a piece of music was recommended or placed in a certain


category. Concise explanations in terms of mid-level features would be attractive,
for example, in recommender systems or search engines for ‘program music’ for
professional media producers, where mid-level qualities could also be used as
additional search or preference filters. As another example, think of scenarios
where we want a music playlist generator to produce a music program with a
certain prevalent mood, but still maintain musical variety within these limits. This
could be achieved by using the mid-level features underlying the mood/emotion
classifications to enforce a certain variability, by sorting or selecting the songs
accordingly.
Generally, the relation between the space of musical qualities (such as our mid-
level features) and the space of musically communicated emotions and affects
deserves more detailed study. A deeper understanding of this might even give
us means to control or modify emotional qualities in music by manipulating
mid-level musical properties. It also forms a strong foundation to analyse musical
performances from purely audio recordings, such as renditions of a piano piece
by different performers.
An observation about the models presented in this chapter is that while the
mid-level to emotion layer is transparent due to the linear connections, the part
of the model which connects the input to the mid-level layer is still a black box.
It is a deep convolutional feature extractor. In the next chapter, we will thus
investigate methods to explain this part of the model, and look at an interesting
use case of debugging biased predictions of an emotion model.
5
T R A C E : T W O - L E V E L E X P L A N AT I O N S U S I N G
I N T E R P R E TA B L E I N P U T D E C O M P O S I T I O N

5.1 The Unexplained Part of a Bottleneck Model . . . . . . . . . . . . . 58


5.2 Going Deeper using Two-Level Explanations . . . . . . . . . . . . . 59
5.3 Local Interpretable Model-agnostic Explanations (LIME) . . . . . . 61
5.4 Explanations via Spectrogram Segmentation . . . . . . . . . . . . . 62
5.5 Explanations using Sound Sources . . . . . . . . . . . . . . . . . . . 68
5.6 Model Debugging: Tracing Back Model Bias to Sound Sources . . . 71
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

In the previous chapter, we asked whether musical emotion can be modelled


through an intermediate set of features that are less abstract than emotions, and
yet hold perceptual relevance. We saw that indeed, these intermediate represen-
tations – mid-level features in our case – have good predictive and explanatory
powers for ratings of perceived musical emotion. These intermediate features
were then used to derive insights for an audio-to-emotion model as well as ex-
plain particular instances of emotion prediction, with minimal impact on the
recognition accuracy, compared to a directly optimised end-to-end model. Al-
though this is an encouraging result in itself, it still leaves a big chunk of the
audio-to-emotion model unexplained. Is there a way to explain the mid-level
feature predictions themselves using features that possibly lie closer to the actual
input? In this chapter, we go deeper and generate explanations for the mid-level
feature predictions using components that have some direct analogue on the
input, in a way that also allows us to generate not only visual, but also listenable
explanations.
The chapter begins by providing a brief description of the background and
previous work on this topic. In particular we look at how multi-level explanations
are related to hierarchical feature learning in deep convolutional networks that is
seen to emerge in models trained on large scale image datasets. We also look at
previous works on post-hoc explanations for images and how these have been
applied in the audio domain.
Next, we describe in detail the two approaches we explored on explaining Mid-
level feature predictions using components derived from the input audio. The
first approach extracts interpretable components from an input spectrogram using
image segmentation and computes the relative importance of these components

57
5.1 the unexplained part of a bottleneck model 58

towards a prediction. These spectrogram components can then be either only


visualised, or converted to a listenable explanation. The second approach uses a
pre-trained source separation model to decompose an input musical audio into
its component instrument tracks (known as stems). These stems are then used
as interpretable features for the post-hoc explanation of a prediction. As with
the previous approach, one can listen to these explanations by playing the most
important stem for a prediction.
Finally, we use our two-level explanations (via the second approach) to debug
a biased model resulting from a highly unbalanced dataset. In this case, we first
identify the Mid-level feature that results in an overestimation of valence for a
particular genre of music, and going deeper, we then identify which component
stem explains this bias in the Mid-level feature prediction. We then make a
qualitative assessment of the model using the explanations, providing us with
an indication of what the model is actually learning, and verifying that upon
training on a balanced dataset, the explanations change in a predictable fashion.
The work presented in this chapter was done in collaboration with my col-
league Verena Praher (née Haunschmid) and has been described in the following
publications:

• S. Chowdhury, V. Praher, G. Widmer


Tracing Back Music Emotion Predictions to Sound Sources and Intuitive
Perceptual Qualities, In Proc. of the Sound and Music Computing Conference,
(SMC 2021), Virtual

• V. Haunschmid, S. Chowdhury, G. Widmer


Two-level Explanations in Music Emotion Recognition, Machine Learning
for Music Discovery Workshop, International Conference on Machine Learning
(ICML 2019), Long Beach, USA

5.1 the unexplained part of a bottleneck model


So far (in Chapter 4), we have used mid-level features learned from data to act as
explanatory features for musical emotion predictions. We have assumed – based
only on the metrics Pearson correlation coefficient and the R2 -score – that the
part of our network that learns these mid-level features actually learns something
musically meaningful from the input audio.
Given that a trained Audio → Mid-level → Emotion model (the A2Mid2E
model from Chapter 4) is represented as fˆ ◦ ĝ, the model part ĝ encodes the input
x into the mid-level feature space. In our models, ĝ is a convolutional network,
either without residual connections between layers (as seen in the “VGG-ish”
variant), or with residual connections (as seen in the “RF-ResNet” variant). In
either case, ĝ is several layers deep, and is essentially a black-box, providing us no
transparency or interpretability into the mapping of input to mid-level features.
In order to bring interpretability to this part of the model, we take inspira-
tion from techniques used in the vision domain, such as Grad-CAM (Gradient-
weighted Class Activation Mapping) [164] and LIME (Local Interpretable Model-
agnostic Explanations) [156]. These methods highlight regions of an input image
5.2 going deeper using two-level explanations 59

most responsible for the output predicted class. Grad-CAM computes gradients
for a target class and traces these back to the input layer, producing a coarse
localisation map. LIME trains an interpretable surrogate model on perturbations
of a particular input-output pair, and computes importance values (e.g. in terms
of weights of a linear model) for the input features (which in the case of images
could be a segmentation map1 ).
The LIME method is particularly attractive to us due to its simplicity and
usefulness for our particular application. It allows us to examine ĝ in a post-hoc
fashion, without any changes in the model or training procedure. An additional
advantage of using LIME, as we will see later on in this chapter, is that it allows
for different kinds of interpretable representations of the input. This enables us
to choose the input modality that works best in terms of properties relevant for
our use-case. In this chapter, we will examine two different input modalities:
spectrogram components and source-separated audio stems.
Previous work on this topic includes the work by Mishra et al. [135] who intro-
duced SLIME (Sound-LIME), which used LIME over an input space composed of
rectangular time-frequency regions of a spectrogram to identify which regions
are important for singing-voice detection. Other recent work on explainability in
music information retrieval tasks include interpretable music transcription using
invertible networks [101], interpretable music tagging using attention layers [178],
and explainability in recommender systems [7].
Circling back to our current theme of explaining music emotion predictions,
in this chapter, we see how to break down the explanations into two levels –
the first using mid-level features (as in Chapter 4), and the second by using
input spectrogram/audio components – with the added objective of making the
explanations listenable.

5.2 going deeper using two-level explanations


The central motivation of our work in this chapter is as follows: we aim to further
extend the explanations provided by mid-level features (as described in Chapter 4)
by explaining the mid-level features themselves in terms of features obtained
directly from the input. By doing so, we wish to ground the complete end-to-end
explanation of our audio-to-emotion model on specific interpretable components
of the actual input.
It is reasonable to ask why we take this approach of deconstructing the input
into interpretable components, as opposed to, say, going lower in the hierarchy
and modelling the mid-level features using low-level features (such as mel-
frequency cepstral coefficients or MFCCs, spectral centroid, chroma, onsets etc.)
extracted analytically from the input audio. The main reason for doing so has to
do with the final performance metrics and the quality of the resulting explana-
tions. Through experimentation, we found that end-to-end learning gives a large
performance advantage for both emotion and mid-level feature modelling over
learning from hand-crafted features. Moreover, explanations based on a large
1 A segmentation map refers to a representation of an image with its pixels mapped to segments,
which are found by a clustering or segmentation algorithm.
5.2 going deeper using two-level explanations 60

Mid-level
features

SECOND LEVEL
EXPLANATIONS
(SPECTROGRAM
OR WAVEFORM)
Interpretable
Representation FIRST LEVEL
EXPLANATIONS
(EFFECTS PLOT)

Surrogate Model
Training Using LIME Select mid-level
feature to explain

Figure 5.1: Schematic of the two-level explanation procedure. A trained bottleneck model
( fˆ ◦ ĝ) is used to obtain emotion predictions ( fˆ( ĝ(x)) and intermediate mid-
level feature values ( ĝ(x)) for that prediction. The mid-level feature values are
then explained via LIME (Local Interpretable Model-agnostic Explanations)
using an interpretable decomposition of the input.

number of hand-crafted features tend to have more complexity and less compre-
hensibility. For instance, it is hard for a typical user to form an intuition about
MFCCs, a very popular feature in audio processing. In contrast, if the explanatory
components have the property of being clearly visualised in an image, or even
better, listened to, then we believe it lends the explanations greater usability and
trust. A broad schematic of our proposed two-level explanation method is shown
in Figure 5.1.
Given a trained audio to emotion model with mid-level intermediates (as in
Figure 4.2), we first obtain the mid-level explanations in terms of effects (see
Section 4.5.2) for an emotion prediction. We then choose a mid-level feature to be
explained further (every mid-level feature can be chosen, one at a time). Finally,
we obtain an interpretable decomposition from the input and use LIME to explain
the chosen mid-level feature in terms of the interpretable components. In our
work, we explored obtaining interpretable components from the input using two
methods:

• Spectrogram decomposition: Using an image segmentation algorithm to


partition the spectrogram into connected components, to roughly capture
visual spectrogram features in the components. Examples of features cap-
tured this way are individual notes, represented as horizontal lines in the
spectrogram marking the harmonics, singing voice formants, and percussive
hits or note onsets.

• Source separation: Decomposing a piece of musical audio into individual


instrument tracks or sources using off-the-shelf trained source separation
5.3 local interpretable model-agnostic explanations (lime) 61

models. We separate the input audio into five tracks – piano, bass, drums,
vocals, and other. This is a very intuitive decomposition of the input since
humans (to be precise, humans who have had prior exposure to the style of
music in question) naturally are able to perceive different characteristics of
different instruments in the music they listen to.

These decomposition methods are described in detail in Section 5.4 and Sec-
tion 5.5. Before delving into the decomposition methods, let us first understand
how LIME works.

5.3 local interpretable model-agnostic explana-


tions (lime)
Motivated by the need for an algorithm to explain the predictions of any black-box
model with interpretable qualitative explanations, Ribeiro et al. [156] proposed
LIME (Local Interpretable Model-agnostic Explanations), a method that has
proven quite influential in the explainable AI world [3, 14]. To explain a prediction,
this method identifies an interpretable model over an interpretable representation
of an input that is locally faithful to the classifier or regressor in question. Ribeiro
et al. [156] define “explaining a prediction” as:

[ . . . ] presenting textual or visual artefacts that provide qualitative


understanding of the relationship between the instance’s components
(e.g. words in text, patches in an image) and the model’s prediction.

We now give a brief description of this method, since we use this algorithm for
obtaining the second level explanations of our two-level explanation scheme.
Let the model under analysis be denoted by ĝ : Rd → R, which takes the
input x ∈ Rd (spectrograms with d pixels) and produces the prediction ĝ(x) (mid-
0
level features). Our aim is to explain the model prediction. We use x0 ∈ {0, 1}d
to denote a binary vector representing the interpretable version of x (typically,
d0  d). How this interpretable representation is derived from the original
input depends on the application, input type, and the intended use case (the
two methods that we propose for decomposing a music audio input into its
interpretable representations are given in Section 5.4 and Section 5.5).
Now, the aim of LIME is to find a surrogate model h ∈ H, where H is a class of
0
potentially interpretable models and {0, 1}d is the domain of h. For example H
could be a space of linear models or decision trees – models that are interpretable
by design. However, not all models in this space will be useful candidates for
interpretability. One of the desirable qualities of an explanation is that it should be
easy to understand by human users. A linear model with hundreds or thousands
of contributing features will not offer any more insight to a human than the
original black-box model. Therefore, a measure of complexity µC ( g, h, x) of an
explanation h ∈ H is introduced at this point. The complexity for a linear model
could be defined as the number of non-zero weights; for a decision tree, it could
be defined as the depth of the tree.
5.4 explanations via spectrogram segmentation 62

A second desirable quality for the explanation of a particular prediction in this


framework is that the interpretable surrogate model should be locally faithful,
meaning that it must correspond to how the model behaves in the vicinity of the
instance being predicted2 . Thus, a proximity measure πx (z) is introduced which
measures the distance between z and x, so as to define a locality around x, and
L( g, h, πx ) is defined as a measure of how unfaithful h is in approximating g in
the locality defined by πx . To enforce both interpretability and local fidelity, we
must then minimise L( g, h, πx ) while having µC ( g, h, x) as a regularising term.
Thus, the explanation is obtained by the following optimisation:

ξ (x) = arg min (L( g, h, πx )) + µC ( g, h, x) (5.1)


h∈ H

The parameters of the surrogate model h is obtained by minimising the locality-


aware loss L( g, h, πx ) without making any assumptions about g, since we want
the explainer to be model-agnostic. This gives us an h which approximates g
in the vicinity of the original input x. The minimisation of the loss function is
achieved by training h over a dataset constructed by drawing samples around
the input x (through perturbation of the input) and passing these through the
original model to obtain the corresponding outputs. That is, given a binary
0
vector x0 ∈ {0, 1}d for the interpretable representation of x, we sample binary
vectors by randomly selecting the non-zero elements of x0 , giving us a perturbed
0
binary vector z0 ∈ {0, 1}d using which we recover the sample in the original
representation z ∈ Rd . From each such perturbed input z, we obtain g(z), which
is used as a label for the explanation model.
Once we have this constructed dataset Z of perturbed samples z and corre-
sponding labels g(z), Equation 5.1 can be optimised over the space H giving us
ĥ, which is the required explanation ξ (x) of the current input x to the black-box
model g.
LIME works with an interpretable representation of the original input. How do
we obtain an interpretable representation for a musical audio? We propose two
methods in the following sections.

5.4 explanations via spectrogram segmentation


The first approach takes inspiration from Mishra et al. [135], who use tempo-
ral, spectral, and time-frequency segments of the input spectrogram to generate
samples to be used as inputs to the LIME pipeline. Their method generates expla-
nations in the form of rectangular sections of the spectrogram; sections important
for a particular prediction are computed using LIME and are highlighted on a
spectrogram visualisation. They show good performance on explaining classifiers,
such as singing voice detectors. However, they also point to the need for bet-
ter, finer-grained representations or audio “objects” for improved interpretation
in the music domain. This is relevant in our case, since we are attempting to
2 Ribeiro et al. [156] emphasise that local fidelity does not imply global fidelity: features that are
globally important may not be important in the local context
5.4 explanations via spectrogram segmentation 63

explain perceptual features, which may be grounded in attributes not correctly


represented by rectangular blocks. Therefore, we ask if there are better ways to
partition a spectrogram into sections which capture perceptually important audio
features, and we come up with a plausible solution.
Instead of splitting the spectrogram into time-frequency blocks, we use an
image segmentation algorithm to give us the interpretable components of the
spectrogram. We tried several segmentation algorithms, like SLIC (Simple Linear
Iterative Clustering) [2], Chan Vese [41], Watershed, and Felzenszwalb [60], and
found by qualitative inspection that Felzenszwalb gives the most reasonable visual
segmentation of the spectrograms. See Figure 5.2a for a comparison between
three segmentation algorithms on a spectrogram, and Figure 5.2b for a detail of
the Felzenszwalb segmentation annotated with some easily identifiable musical
features.

5.4.1 Computing Importance Weights of Spectrogram Segments using LIME

Let us denote a trained A2Mid2E model as FE := fˆ ◦ ĝ, which takes the spectro-
gram x ∈ Rd and produces two outputs: the mid-level feature predictions ĝ(x)
and the emotion predictions fˆ( ĝ(x)). We need LIME to find a surrogate model
ĥ ∈ H for ĝ. We restrict H to be the class of linear regression models.
We use the Felzenszwalb algorithm to segment the input x and obtain the binary
0
representation x0 ∈ {0, 1}d of the segmented input, where d0 is the number of
segments. Note that this results in a sample identical to the original input only if
all the segments are turned on (i.e. x0 = [1, 1, . . . , 1]). All other samples will be
perturbations of the original sample. In our experiments, we find that the most
satisfactory results are obtained by using the Python package skimage3 with the
parameters scale = 25 and min_size = 40. We generate 50,000 perturbations of
the input spectrogram and train a linear model h on the dataset Z generated by
the input (zi ) and output ( ĝ(zi )) pairs resulting from these perturbations.
The number of most important segments to be visualised in the final explana-
tion is a controllable parameter that the user can choose. In our case, we select it
automatically by thresholding on the p-value to weight ratio. For our experiments,
we observed a ratio of 10−6 to work well, which selects about 30 to 60 features
from a total of about 600.
The final output of the Audio-to-Mid explanation process are two spectrograms
that show the image segments with positive and negative weights, respectively –
in other words, those aspects of the spectrogram that most strongly contributed
to the prediction, in a positive or negative way. The other parts of the spectrogram
0 , x0 d0
are hidden. That is, we find the spectrogram masks xpos neg ∈ {0, 1} , with
m and n non-zero elements respectively, where m and n are obtained out of
the thresholding based selection of features mentioned above. In case of the
positive explanation, positive weights are chosen, and vice versa for the negative
explanation.

3 https://scikit-image.org/docs/dev/api/skimage.segmentation.html
5.4 explanations via spectrogram segmentation 64

(a) A sample spectrogram (top left) with different segmentation algorithms applied. Visually,
Felzenszwalb appears to capture meaningful segments in the spectrogram.

(b) Detail of the Felzenszwalb segmentation on the sample spectrogram. Some musically relevant
segments are indicated.

Figure 5.2: Decomposition of a spectrogram into interpretable components using image


segmentation algorithms. The interpretable decomposition is then used to
generate an explanation using LIME [156], which assigns importance values
to each segment for a particular prediction.

5.4.2 Evaluation of Explanations

For testing the segmentation based explanations, we use a A2Mid2ERF-ResNet


model on samples from the Mid-level Features Dataset that were held out during
model training. We compute the average fidelity and complexity measures (as
introduced in Chapter 3, Section 3.4) as evaluation metrics. Recall that local
fidelity measures how closely the explanation function reflects the model being
5.4 explanations via spectrogram segmentation 65

Original Spectrogram

Positive explanations for "articulation" Negative explanations for "articulation"

Reconstruction with segments important


for "articulation" emphasized

Figure 5.3: Spectrogram segments with positive and negative effect for the prediction of
articulation. A modified spectrogram with weighted segments according to
the explanations is reconstructed for auralisation.

explained locally, and complexity measures how “efficient” the explanation


is – in terms of providing maximal information about the prediction using
minimal number of features. We use the R2 -score of the surrogate model as the
fidelity measure. In our tests, the explanations using spectrogram segmentation
yielded a mean R2 -score across all mid-level features of 0.72, showing that the
surrogate model is able to capture a significant amount of variance of the model
predictions locally. For complexity, we compute the normalised entropy of the
importance weights of the spectrogram segments (entropy of the importance
weights distribution normalised by the entropy of a uniform distribution). In our
tests, we obtain a mean normalised entropy of 0.86 across all mid-level features,
reflecting that the explanations are more informative than the random baseline
(a random explanation would select spectrogram components according to a
uniform distribution).
As qualitative evaluation, we construct some manual examples by concatenating
two songs with significant difference in their mid-level predictions. We feed
this concatenated audio to the mid-level predictor model again, and generate
explanations for the mid-level predictions of this audio. In Figure 5.4, the first
song contains a strong and steady drum beat (noticeable as vertical lines on the
spectrogram), and gives a high prediction for rhythm stability. The second song is
5.4 explanations via spectrogram segmentation 66

(a) Concatenated test sample spectrogram.

(b) Positive explanations for “rhythm stability”

(c) Positive explanations for “melodiousness”


Rhythm Stability Melodiousness
0.015 0.025
entropyrel = 0.89 entropyrel = 0.87
Importance Weight

Importance Weight

0.020
0.010
0.015
0.010
0.005
0.005
0.000 0.000
0 200 400 600 0 200 400 600
Spectrogram Segment Spectrogram Segment

(d) Importance weight distribution

Figure 5.4: Explanations for “rhythm stability” and “melodiousness” on a test sample
that is constructed by concatenating two different musical pieces. The first
piece separately had a high predicted value of rhythm stability compared to
the second (0.75 vs 0.09) and the second piece separately had a high predicted
value of melodiousness compared to the first (0.56 vs −0.04).
5.4 explanations via spectrogram segmentation 67

(a) Concatenated test sample spectrogram

(b) Positive explanations for “articulation”

(c) Positive explanations for “dissonance”


Articulation Dissonance
0.010 0.020
entropyrel = 0.93 entropyrel = 0.89
0.008
Importance Weight

Importance Weight

0.015
0.006
0.010
0.004
0.002 0.005

0.000 0.000
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Spectrogram Segment Spectrogram Segment

(d) Importance weight distribution

Figure 5.5: Explanations for “articulation” and “dissonance” on a test sample that is
constructed by concatenating two different musical pieces. The second piece
had a high predicted value of both articulation and dissonance compared to
the first (0.36 vs 0.02) and (0.51 vs 0.33) respectively.
5.5 explanations using sound sources 68

a solo classical guitar piece with held notes (noticeable as horizontal lines on the
spectrogram), and gives a high prediction for melodiousness. We observe that the
vertical spectrogram components are indeed highlighted in the explanation for
rhythm stability and the horizontal components are highlighted in the explanation
for melodiousness. In the second example, shown in Figure 5.5, the first song
is a choir performance with low articulation and dissonance than the second
song, which is an up-tempo multi-instrument jazz piece with drums and piano.
The explanation for articulation highlights the note onsets in the second song.
However, the explanation for dissonance is not as clear in this case.
Next, the explanations could be auralised, and we use the Griffin-Lim algorithm
[78] to deconvolve magnitude spectrograms and generate the corresponding au-
dio waveform, which can then be listened to. The positive/negative explanation
spectrograms can be auralised individually, but in order to improve the quality
and hear it in context, we merge them with the original spectrogram by ampli-
fying the spectrogram elements corresponding to the positive explanation and
attenuating the elements corresponding to the negative explanation. This gives us
our final listenable explanation for a mid-level feature prediction. Some examples
could be heard here: 0
As we can see, the explanation method described in this section has the po-
tential to increase our trust in the trained mid-level feature model (because the
explanations of mid-level feature predictions correspond with human expectation,
and are not arbitrary). However, there are some drawbacks. Firstly, the spectro-
gram decomposition is based on an image segmentation algorithm (as opposed to
a musically relevant decomposition) and does not always capture useful musical
features. Secondly, since the Griffin-Lim algorithm only gives an approximate
inversion of the magnitude spectrogram, the reconstruction quality is not high.
Thus, we next look at another way to decompose the input audio, which is more
musically inspired, although we lose some fine-graininess.

5.5 explanations using sound sources


When listening to music, one of the natural ways that humans intuitively discern
different “objects” is by recognising the different instruments, or sound sources,
there are in it (of course, here we are talking about music that does in fact have
multiple instruments). Even if casual listeners may not always be aware of doing
so, this affects their perception of music, to the level that they can refer to these
sources as having distinct expressive qualities. Audiences are moved by the vocal
performance of Bohemian Rhapsody by Queen , while the bassline of Another One
Bites the Dust by the same band is more impactful for them. Music reviews often
focus on individual instrument performances separately in a song. Fans are less
than pleased when one member of their favourite band is replaced by another for
a live tour4 . These examples suggest that the overall perception of a piece of music
can be affected by subtle variations in the individual instrument components –
variations in the form of style, timbre, timing, dynamics, or even the mixing and

4 This has a social component as well.


5.5 explanations using sound sources 69

production choices. The choice of an instrument itself, as an expressive medium,


also offers different opportunities to the composer or performer to vary different
features [98]. Thus, it is reasonable to deconstruct an audio input under analysis
into its component tracks or sources and use this deconstructed set of audio
components as the interpretable representation of the original input sample.
This idea has been explored previously by Haunschmid et al. [84] to explain
music genre tag prediction models, and was implemented in a Python package
called audioLIME5 . Following this, Berardinis et al. [25] used a model built on
source-separated audio to predict and interpret music emotion predictions. In
this thesis, we use audioLIME to explain Mid-level feature predictions.

5.5.1 Computing Importance Weights of Sound Sources Using audioLIME

The audioLIME method [84] is based on the LIME framework described pre-
viously in Section 5.3 and extends its definition of locality for musical data by
using separated sound sources as the interpretable representation. This gives
audioLIME the ability to train on interpretable and listenable features. The key
insight of audioLIME is that interpretability with respect to audio data should
really mean listenability.
In order to generate the interpretable representation, the original input audio is
decomposed into its sources using one of the several available source separation
packages6 (we use Spleeter [88]). The source separation problem is formulated as
estimating a set of C sources, {S1 , ..., Sc }, when only given access to the mixture
M of which the sources are constituents. We note that this definition, as well
as audioLIME, is agnostic to the input type (waveform or spectrogram) of the
audio. We use these C estimated sources of an input audio as our interpretable
components, i.e. x0 ∈ {0, 1}C is the interpretable input representation. In our case,
C = {piano, drums, vocals, bass, other}. As in the case of spectrogram segments
of Section 5.4, a perturbed input z0 to the model will have some of the components
turned off. For example z0 = {0, 1, 0, 1, 0} results in a mixture only containing
estimates of the drums and the bass tracks. The relation of this approach to
the notion of locality as used in LIME lies in the fact that samples perturbed in
this way will in general still be perceptually similar to the original input (i.e.,
recognised by a human as referring to the same audio piece). This system is
shown in Figure 5.6.
In this case, since |C | = 5, the maximum possible perturbations for an input is
25 = 32, which is small enough to use the whole set of perturbations to obtain
the dataset Z for training the surrogate linear model h using LIME.

5.5.2 Evaluation of Explanations

Similar to the case of spectrogram segmentations, we evaluate our explanations


using audioLIME using fidelity and complexity measures. However, here, we will
perform these tests using two datasets that will be relevant in the next section

5 https://github.com/CPJKU/audioLIME
6 https://source-separation.github.io/tutorial/intro/open src projects.html
5.5 explanations using sound sources 70

Interpretable
decomposition
(stems)

Permute, mask, and


Input combine

Perturbed Samples
Source
Separation

Obtain Labels

Train Interpretable Surrogate Model

Figure 5.6: AudioLIME schematic. The input audio is deconstructed into its component
instrument stems using an off-the-shelf source separation algorithm (we
use Spleeter [88]). The components are then permuted and mixed to give
us perturbed samples in the neighbourhood of the original sample. The
perturbed samples are then passed through the LIME pipeline to train a local
interpretable surrogate model. Image adapted from Haunschmid et al. [84]

(Section 5.6). There, we are going to encounter the DEAM and PMEmo datasets,
which are emotion datasets containing audio samples and arousal/valence an-
notations. For the analysis in the present section, we train the A2Mid2ERF-ResNet
model using either DEAM (notated as “D”), or PMEmo (notated as “P”) or both
(notated as “P+D”) for the emotion labels, and the usual Mid-level Features
dataset for the mid-level feature labels. We focus the analysis on explanations of
“rhythm stability”, as this is relevant for Section 5.6 as well. The fidelity and com-
plexity measures are computed for explanations generated on held-out samples
from either the DEAM (D), PMEmo (P), or both (P+D) datasets.
We can see in Figure 5.7a that the fidelity score (coefficient of determination,
or R2 -score) is relatively high across all combinations of train and test sets. The
median score is 0.86 across all explanations (median taken across all samples
and all mid-level features). This means that for 50% of the explanations, more
than 86% of the variation in the dependent variable (mid-level prediction) can be
predicted using the independent variables (instrument stems). Next, we look at
complexity. Figure 5.7b shows the computed complexities, compared to a random
baseline. For all train and test set combinations, the majority of explanations are
significantly less complex than the random baseline.
For auralisation, since the input components are listenable by themselves, we
do not require a separate deconvolution step to render the listenable waveforms.
One can simply play and listen to the sound source (which was extracted from
the original mix via the source separation algorithm) with the maximum weight.
The quality of the resulting audio sources are dependent on the source separation
algorithm. In the present case, Spleeter is able to separate the five sources (vocals,
5.6 model debugging: tracing back model bias to sound sources 71

(a) Fidelity (higher is better). (b) Complexity (lower is better).

Figure 5.7: Figure 5.7a shows the computed fidelity (coefficient of determination, or R2 -
score, between the predictions by the global model and the local model) scores
for the evaluated explanations. Figure 5.7b shows the complexity (entropy of a
distribution over the feature attribution weights, normalised by the entropy of
a uniform distribution) scores for the evaluated explanations. The green region
shows the standard deviation of complexities for 1000 random “explanations”,
with the black line being the mean. The tuples indicate the train set and the
test set, e.g. “(P+D, P)” means that the model was trained on the combined
PMEmo + DEAM dataset, and tested on held-out samples from the PMEmo
dataset.

piano, drums, bass, and other) resulting in five high quality stems for each
explanation.

5.6 model debugging: tracing back model bias to


sound sources
In many practical applications, it is not easy to determine the quality of pre-
dictions made by a machine learning model. Even though models might not
make any obvious errors, they can propagate bias present in the data leading to
potentially discriminatory or unethical decisions downstream. Model debugging
and model fairness are emergent disciplines that aim at finding and fixing such
problems in machine learning systems [80, 131]. Bias correction and fairness in
machine learning is becoming an increasingly important area of research [152].
What does this have to do with music emotion predictions? In today’s world
of music streaming, musicians depend on the streaming platform’s algorithms
to reach their audience. Music recommendation algorithms are critical in this
regard to the career and livelihood of musicians from around the globe. These
recommendation algorithms work on the basis of various audio-content-based
data, including predicted emotion. Therefore, all algorithms involved in this
pipeline must be up to the highest standards of bias and fairness quality control.
Recent research into new paradigms of music emotion recognition, such as
Gómez-Cañón et al. [74], highlights this point.
With this overarching motivation, we present here a brief exposition of a model
debugging procedure where we use the two-level explanations to understand
why an improperly trained model overestimates the valence predictions for one
5.6 model debugging: tracing back model bias to sound sources 72

particular genre, and how modifying the training data to balance that genre’s
representation leads to changes in the model predictions as well as model expla-
nations in a predictable fashion. To begin with, let us familiarise ourselves with the
datasets involved.

5.6.1 Datasets

As before, we have the Mid-level Features dataset [9] from which we obtain the
training audio and annotations for the mid-level part of our A2Mid2E model. For
the emotion part, we use the DEAM and PMEmo datasets, both of which contain
more samples than the Soundtracks dataset, which was used in Chapter 4 for
training the emotion part. The DEAM and PMEmo datasets contain audio and
ratings for arousal and valence.

• DEAM: Database for Emotional Analysis in Music: The DEAM dataset [10] is a
dataset of dynamic and static valence and arousal annotations. It contains
1,802 songs (58 full-length songs and 1,744 excerpts of 45 seconds) from
a variety of Western popular music genres (rock, pop, electronic, country,
jazz, etc). In our experiments, we use the static emotion annotations, which
are continuous values between 0 and 10.

• PMEmo: Popular Music with Emotional Annotation: The PMEmo dataset [184]
consists of 794 chorus clips from three different well-known music charts.
The songs were annotated by 457 annotators with valence and arousal
annotations separately for dynamic and static. In our experiments, we use
static labels, which are continuous values between 0 and 1.

5.6.2 Setup

Our experiment consists of training our models on the above two datasets. While
both datasets have arousal and valence annotations corresponding to audio clips,
the genre distributions of the two datasets are very different7 (see Figure 5.8,
which highlights the difference between the number of hiphop songs in the
datasets including the Mid-level Features dataset). Our aim is to check for bias
in a particular training scenario, and use explanations to verify change in model
behaviour upon changing the training scenario. The hypothesis is that a model
trained on the DEAM dataset, whose genre distribution is very different from the
PMEmo dataset, will exhibit some kind of bias when tested on the PMEmo dataset,
and the hope is that the mid-level based and sound-source based explanations
will help us understand biases in the musical components that ultimately lead
to the bias in the emotion predictions. For this experiment, the annotations from
both emotion datasets are scaled to be between 0 and 1, so that the annotations
from the two sources could be combined when required. The test set is a fixed
but randomly chosen subset of the PMEmo dataset.
7 To estimate the genre distribution of the datasets, we use a pre-trained music tagger model [149] to
predict genre tags for all the tracks in the three datasets (PMEmo, DEAM, and Mid-level Features
Dataset), since we do not have genre metadata for these datasets.
5.6 model debugging: tracing back model bias to sound sources 73

Figure 5.8: Compositions of datasets as fraction of songs tagged ”hiphop” by a pre-trained


auto-tagging model [149].

(Train-set, Test-set) Architecture Arousal R2 Valence R2 Mid-level R2 Mid-level r

(D, D) A2E 0.50 0.52 − −


(D, D) A2Mid2E 0.54 0.54 0.32 0.57

(P, P) A2E 0.68 0.46 − −


(P, P) A2Mid2E 0.68 0.40 0.33 0.58

(D, P) A2Mid2E 0.44 0.25 0.32 0.57


(P+D, P) A2Mid2E 0.64 0.47 0.31 0.59

CoE(D, D) − −0.04 −0.02 − −


CoE(P, P) − 0.00 0.06 − −

Table 5.1: Performance of explainable bottleneck models (A2Mid2E) compared with the
end-to-end counterparts (A2E) on different train/test dataset scenarios. All
models use the RF-ResNet backbone. For the A2Mid2E models, the Mid-level
Features dataset is used to train the mid-level part. R2 refers to the average
coefficient of determination, and r refers to the average Pearson correlation
coefficient. The cost of explanation (CoE) is also calculated: a positive cost
means that the bottleneck model (A2Mid2E) performs worse compared to
the end-to-end model (A2E), while a negative cost implies that the bottleneck
model performs better compared to the end-to-end model.

First, we train the A2Mid2ERF-ResNet model on the two datasets and compare
the performances with the corresponding end-to-end A2ERF-ResNet models (first
four rows in Table 5.1) to establish a baseline and verify that both the mid-level
features and emotions are being learnt. We also compute the cost of explanation
(CoE) on the R2 -scores for these training scenarios. A positive cost of explanation
means that the bottleneck model (A2Mid2E) performs worse compared to the end-
to-end model (A2E). A lower cost of explanation is desired as that indicates that
the mid-level bottleneck does not adversely affect the actual emotion prediction
performance. We observe that for both dataset scenarios – (D, D) and (P, P) – the
CoE is low. In fact, for the (D, D) case, the bottleneck model actually improves
performance, resulting in negative CoEs for both arousal and valence8 .

8 This may be because training with additional data (for training the mid-level layer) improves the
overall model performance in this case, given that the DEAM dataset and the Mid-level Dataset
share a common source for a portion of data (see Appendix a)
5.6 model debugging: tracing back model bias to sound sources 74

Figure 5.9: Fraction of hiphop songs in quantiles vs the mean valence error of each
quantile over PMEmo dataset (with model trained on DEAM)

5.6.3 Overestimated Valence for Hiphop

When we take the A2Mid2E model trained only on DEAM and use it to predict
arousal and valence for the entire PMEmo dataset, we observe that the error in
valence shows a pattern – overestimations of valence primarily occur in hiphop
songs, as shown in Figure 5.9. We can reason about relatively poor performance
for hiphop songs based on the discrepancy between the training and testing sets
in terms of genre composition. In Figure 5.8, we can see that PMEmo has a large
percentage of hiphop songs whereas both DEAM and Mid-level datasets have
a small percentage. Since our model has not seen enough hiphop songs during
training, it is to be expected that it does not perform well when it encounters
hiphop during test. However, a question that is pertinent next is – what is it about
hiphop songs that makes our model overestimate their valence?

5.6.4 Explaining Valence Overestimations Using Mid-level Features

To answer this question, we first seek to understand which of the mid-level


qualities can be attributed most to high valence predictions. This is the first
level of our explanation system. We find the attribution (importance) of each
mid-level feature for valence prediction by computing its effect on valence (the
effect of a feature is the value of that feature multiplied by the weight of the linear
connection between it and the target emotion; see Chapter 3)). In our case, the
target is valence and there are seven mid-level features that affect it. We are only
interested in relative contribution of each feature, and so we divide each effect by
the sum of the absolute values of the effects of all features and take the average
across all test songs tagged “hiphop”. The result of this procedure is plotted in
Figure 5.10a. We observe that rhythm stability has the maximum positive relative
effect on the prediction of valence. Therefore, we select rhythm stability for the
next step of explanation.
5.6 model debugging: tracing back model bias to sound sources 75

0.4

0.2

relative contribution
0.0

0.2

0.4

rhythm_complexity

rhythm_stability

minorness
articulation

tonal_stability
melody

dissonance
(a) Trained on DEAM

(b) Trained on PMEmo+DEAM

Figure 5.10: Relative effects of the mid-level features for valence prediction for two models
trained on different datasets (only DEAM or DEAM+PMEmo), and tested
on the same fixed subset of hiphop songs from the PMEmo dataset.

5.6.5 Explaining Rhythm Stability Using Sources

Once we have selected a mid-level feature as having the most positive relative
effect on the valence, we would like to understand what musical constituents in
the input can explain that mid-level feature. To do this, we use audioLIME and
generate source based explanations for rhythm stability. The sources available in
the current implementation of audioLIME are vocals, drums, bass, piano, and
other. From the PMEmo dataset, we take the top-50 valence errors in songs tagged
as “hiphop”, and compute the explaining source for rhythm stability. We do the
same for songs tagged as “pop”. Looking at Figure 5.11a, we see that vocals are
a major contributing source for the rhythm stability predictions for the hiphop
songs. Compare this to the results for pop songs (Figure 5.11b), where drums are
(not surprisingly) the dominant contributing source of rhythm stability, although
vocals still seem to be important.
5.6 model debugging: tracing back model bias to sound sources 76

30 30
25 25

Number of samples

Number of samples
20 20
15 15
10 10
5 5
0 0
vocals other piano drums bass vocals other piano drums bass
Explaining source Explaining source

(a) (b)

Figure 5.11: Explaining sources for rhythm stability in songs with top-50 valence errors
for songs tagged with (a) “hiphop”, and (b) “pop”.

5.6.6 Re-training the Model with Target Data

Bringing together our two types of explanations, we can reason that the high
valence predictions for hiphop songs is due to overestimation of rhythm stability,
which, in this case, can be attributed to the vocals. While there is a lot of diversity
in the style of rapping (the form of vocal delivery predominant in hiphop), it
has been noted that rappers typically use stressed syllables and vocal onsets to
match the vocals with the underlying rhythmic pulse [4, 138]. These rhythmic
characteristics of vocal delivery (that constitute “flow”, and may add metrical
layers on top of the beat) contribute strongly to the rhythmic feel of a song. The
positive or negative emotion of hiphop songs is mostly contained in the lyrics
– the style of vocal performance does not necessarily express or correlate with
this aspect of emotion. Therefore, it makes sense that a model which has seen
few examples of hiphop during training should wrongly associate the prominent
rhythmic vocals of hiphop to high rhythm stability and in turn high valence.
A model that has been trained with hiphop songs included, we expect, would
place less importance on rhythm stability for the prediction of valence, even if
the vocals might still contribute significantly to rhythm stability. Thus, we expect
the relative effect of rhythm stability for valence to decrease in such a model.
This is exactly what we observe on a model trained with the combined
PMEmo+DEAM dataset (remember that the PMEmo dataset contains a higher
proportion of hiphop songs). The average relative effects are shown in Figure 5.10b
and we can see that the relative effect of rhythm stability has decreased while
those of minorness, melody, and tonal stability have increased. Thus, the model
changed in a way that was in line with what we expected from the analysis of
our two-level explanation method.
Looking at mean overestimations (Figure 5.12) in valence for hiphop and other
genres for models trained on DEAM and PMEmo+DEAM shows that valence
overestimations of hiphop songs have decreased substantially, without changing
the valence overestimations on other genres. The overall test set performance
improves (as expected) for the model trained on the PMEmo+DEAM train set.
5.7 conclusion 77

Figure 5.12: Mean valence overestimations for two models trained on different datasets,
but tested on the same fixed subset of the PMEmo dataset.

The model trained only on DEAM and tested on the PMEmo test set gives R2 -
scores of 0.44 for arousal and 0.25 for valence, while the model trained on the
PMEmo+DEAM combined train set gives R2 -scores of 0.64 for arousal and 0.47
for valence (see Table 5.1).

5.7 conclusion
In this chapter, we proposed a method to explain a mid-level feature model
using an interpretable decomposition of the input. In the context of this thesis,
this method is intended to be used in combination with the bottleneck-based
explanations of music emotion predictions (of Chapter 4), together making a
two-level hierarchical explanation pipeline. However, this method is generic, and
could possibly be used in a standalone fashion as well.
Considering the pipeline, first explanations of music emotion predictions are
derived using mid-level features as explanatory variables. These mid-level predic-
tions are then further explained using components from the actual input, selected
using a post-hoc explanation method. We use LIME and audioLIME to do this.
LIME (and audioLIME) trains an interpretable surrogate model using a dataset of
perturbed samples of the current sample to be explained. This surrogate model
approximates the local behaviour of the mid-level predictor in the neighbourhood
of that particular sample. LIME is used when the input is decomposed into spec-
trogram segments, and audioLIME is used when the input audio is decomposed
into instrument stems using a music source separation algorithm.
We also demonstrated a potential application of this method as a tool for model
debugging and verifying model behaviour. The explanations provided a way to
qualitatively verify expected change in model behaviour upon switching from an
unbalanced/skewed dataset to a more balanced one.
6
T R A N S F E R : M I D - L E V E L F E AT U R E S F O R
P I A N O M U S I C V I A D O M A I N A D A P TAT I O N

6.1 The Domain Mismatch Problem . . . . . . . . . . . . . . . . . . . . 79


6.2 Domain Adaptation: What is it? . . . . . . . . . . . . . . . . . . . . 81
6.3 Visualising the Domain Shift . . . . . . . . . . . . . . . . . . . . . . 85
6.4 Bridging the Domain Gap . . . . . . . . . . . . . . . . . . . . . . . . 86
6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Much like language, where a written sentence may be uttered differently by


different people (influenced by intention, mood, context, individual character, etc.),
music too is often interpreted differently by different performers (and on different
occasions). In the Western classical music tradition, a notated piece or composition
is typically performed not merely as a literal acoustic rendering of the score1 ,
but rather, it is transformed by the performer’s own expressive performance
choices, relating to such dimensions as the choice of tempo, expressive tempo
and timing variations, dynamics, articulation, and so on. The emotional effect
of a performance on a listener can be a consequence both of the composition
itself, with its musical properties and structures, and of the performance, the
way the piece was played. Building computational models of expressive musical
performance has remained a topic of interest over the years [36, 176], and one of
the overarching research questions of this thesis is on building systems that can
capture emotion-related effects of these subtle performative variations between
different renditions of a set of piano pieces.
In the previous chapters, we have looked at learning mid-level features from
data and using these as interpretable representations to explain downstream
music emotion predictions. This system of predicting and explaining musical
emotion via musically meaningful features appears to be ideal for studying the
diversity in expressive performance of piano pieces. To be precise, two factors
make utilisation of mid-level features advantageous for our purpose – 1) these
features capture the variation of emotion in music sufficiently well (as we saw

1 Musical score, or sheet music, is a handwritten or printed form of musical notation that uses
musical symbols to indicate the pitches, rhythms, or chords of a song or instrumental musical
piece.

78
6.1 the domain mismatch problem 79

in Section 4.4, and will again see in Chapter 7), and 2) these features are few in
number and have intuitive musical relevance, making them easy to interpret.
However, since the mid-level features are trained from data, the training data
distribution impacts the generalisation of the model. We saw an example of this in
Chapter 5 where lack of examples from the hip-hop genre led to a music emotion
model overestimating the valence of songs from this genre in the test set. What
can we expect when a mid-level model trained on the Mid-level Features dataset
[9] is used to predict the mid-level features for the solo piano music that we want
to study? How can we transfer a model from the training data domain to the solo
piano domain? We will answer these questions in the present chapter.
The rest of this chapter is broadly based on the following publication:

• S. Chowdhury and G. Widmer


Towards Explaining Expressive Qualities in Piano Recordings: Transfer
of Explanatory Features via Acoustic Domain Adaptation, In Proc. of the
International Conference on Acoustics, Speech and Signal Processing (ICASSP
2021), Toronto, Canada

6.1 the domain mismatch problem


Expressive performance is an extremely important aspect of classical Western
music. Expert performers control and modulate various parameters like dynamics,
articulation, and timing, to transform a notated piece of music into an expressive
rendition, thereby moulding the perceived emotional character of the piece.
Building computational models of expressive performance has been a topic of
interest in the field of computational music, with motivations ranging from
studying musical performance, to generating automated or semi-automated
performances [36, 177].
In our case, we wish to narrow in on the question of capturing the musical emo-
tion expressed in classical piano performances by different pianists. Continuing
the direction we took in Chapter 4 of using mid-level features to model musical
emotion, we would eventually like to explore whether these features provide an
advantage in modelling the emotion of piano performances as well, particularly,
as we will see in the next chapter, focusing on capturing variations in emotion
between different piano performances of the same piece of music.
This is where a severe mismatch problem arises: there is no annotated ground
truth data available for training mid-level feature extractors in classical piano
music, and obtaining such data would be extremely cumbersome. At the same
time, recordings of solo piano music are very different, musically and acoustically,
from the kind of rock and pop music contained in the available Mid-level Features
Dataset [9]. For instance, looking at the genre distribution of this dataset in
Figure 6.1, we can see that less than 10% of the dataset consists of classical pieces,
and only a fraction of those are solo piano recordings. It is thus likely that a
model trained on this dataset will not generalise well to our piano recordings.
This problem is known as domain mismatch – a discrepancy between the kind of
data available for training a model and the data on which it should then operate.
6.1 the domain mismatch problem 80

Figure 6.1: Genre distribution of Mid-level Dataset according to genre tags predicted
using the pre-trained tagging model “musicnn” [149].

The work presented in this chapter is motivated by the need to reduce the
aforementioned domain discrepancy with the goal of ultimately enabling transfer
of the mid-level feature model to solo piano recordings. We employ several
techniques to achieve this. First, in Section 6.4.1 we see how the receptive-field
regularised model (previously encountered in Section 4.2.3) results in improved
generalisation compared to a VGG-ish model. Next, we use a domain adaptation
(DA) approach to reduce the discrepancy between the representations learned by
the model for the source and target domains, thus improving the performance on
the target domain. As we will see in Section 6.4.2, we need to use an unsupervised
domain adaptation approach, since we do not have a large set of labelled examples
of solo piano recordings to learn a supervised transfer scheme from. Finally, we
refine our domain adaptation pipeline by introducing an “ensembled self-training”
procedure, i.e., we use an ensemble of domain-adapted teacher models to train
a student model that performs better on the target domain than any of the
individual teacher models separately.
To put our domain adaptation pipeline to the test, we apply it to transfer a
mid-level model to audio from the Con Espressione Game2 , which is a part of a
large project3 aimed at studying the elusive concept of expressivity in music with
computational and, specifically, machine learning methods [177]. The data from
this game relates to personal descriptions of perceived expressive qualities in
performances of the same pieces by different pianists. Can the mid-level features
be used to learn and model such subjective descriptions of piano performances?
In Section 6.5.1, we find that a domain-adapted mid-level feature model indeed
improves in performance at the task of modelling perceived expressivity dimen-

2 http://con-espressione.cp.jku.at/short/
3 https://www.jku.at/en/institute-of-computational-perception/research/projects/
con-espressione/
6.2 domain adaptation: what is it? 81

sions derived from abstract free-text descriptions of expressive qualities in piano


music.
Let us begin by familiarising ourselves with the theoretical aspects of domain
mismatch and domain adaptation, and looking at some relevant literature before
we delve into our approaches.

6.2 domain adaptation: what is it?


In order to train a machine learning model using supervised learning, matched
example pairs of inputs and outputs are required. The ability of a computer
system to fit its parameters using data from such examples and generalise the fit
pattern to unseen data is what constitutes the notion of “learning”. However, if
the available example data is not an accurate reflection of the data on which the
model is ultimately intended to be used, then the system will not generalise well.
Standard classifiers and regression models cannot cope with such a change in
the distribution of data between the training and test phases. Domain adaptation
and transfer learning are sub-fields within machine learning that are concerned
with accounting for these types of changes [110]. The goal of transfer learning
and domain adaptation is to minimise the model error on the test data (referred
to as the target domain) even if the training data distribution is different (referred
to as the source domain). Let us now take a formal view at some of these terms
based on Ben-David et al. [22], Kouw and Loog [110], and Farahani et al. [59].
Definition 6.2.1 (Domain). A domain U is defined as the combination of an input
space X, an output space Y, and an associated probability distribution p(x, y),
such that U (x, y) = {(xi , yi )}in=1 ∼ p(x, y), where x ∈ X ⊂ Rd , and y ∈ Y ⊂ R. In
other words, inputs x are subsets of points from the d-dimensional real space Rd
(sometimes referred to as feature vectors, or points in feature space), and outputs
y are continuous values in the case of a regression task (or classes in the case of a
classification task).
In our present context, we are dealing with a regression problem, where we
are trying to predict the values of the seven mid-level features. Note that since
we want to train and transfer a model that predicts mid-level features, these
“features” are the outputs (a single output mid-level feature is denoted here by
y), and the input features x ∈ Rd are the spectrogram patches, d being the total
of number of pixels in an input sample. The probability distribution p(x, y) can
be decomposed as p(x, y) = p(x) p(y|x) or p(x, y) = p(y) p(x|y), where p(·) is a
marginal distribution and p(·|·) is a conditional distribution. The domain induced
by the distribution of the training dataset is called the source domain (S ) and that
induced by the distribution of the test data is called the target domain (T ).
Domain shift can be categorised into three types: prior shift, covariate shift,
and concept shift [59, 110].
• Prior shift or class imbalance occurs when the posterior distributions
ps (y|x) and pt (y|x) are equivalent and the prior distributions of the classes
are different between domains, ps (y) 6= pt (y). To solve this type of shift, we
need labelled data from both source and target domains.
6.2 domain adaptation: what is it? 82

• Covariate shift refers to a situation where the posterior distributions ps (y|x)


and pt (y|x) are equivalent, but the marginal distributions of the inputs
across both domains differ, i.e. ps (x) 6= pt (x). It occurs most often when
there is a sample selection bias. This is one of the most studied forms of
domain shift, and most of the proposed domain adaptation techniques aim
to solve this class of domain gap.

• Concept shift, is a scenario where marginal data distributions remain


unchanged, ps (x) = pt (x), while conditional distributions differ between
domains, ps (y|x) 6= pt (y|x).

What kind of domain shift can we expect in our mid-level feature learning
case? We will take a closer look at our mid-level feature data distribution in
Section 6.3 to answer this question, and we will see that the shift between piano
and non-piano subsets of the data can be considered as a covariate shift.
Most domain adaptation strategies typically aim at learning a model from the
source labelled data that can be generalised to a target domain by minimising
the difference between domain distributions. Given a source domain S(x, y) =
{(xi , yi )}i ∼ ps (x, y) and a target domain T (x, y) = {(x j , y j )} j ∼ pt (x, y), the
difference between the domain distributions could be measured by using one
of several metrics such as Kullback-Leibler divergence [112], Maximum Mean
Discrepancy (MMD) [77], and Wasserstein metric [115]. In unsupervised domain
adaptation where the labels are not available in the target domain, u is unknown.
The domain adaptation literature is quite rich and several approaches have
been proposed for both supervised and unsupervised scenarios for structured as
well as unstructured data. Giving an exhaustive overview of these methods is out
of scope for this thesis, however the interested reader is encouraged to refer to
some of the following works. Ben-David et al. [22] provides a good introduction to
the theoretical aspects of domain adaptation, which can be followed by Ben-David
et al. [24] and Ben-David et al. [23] as further reading. Zhang [185] provides a
comprehensive survey of unsupervised domain adaptation for visual recognition.
Domain generalisation and domain adaptation are both discussed in detail in
Wang et al. [172]. For domain adaptation on audio-related data, there exists some
work in the area of acoustic scene classification [1, 73], speech recognition [102],
and speaker verification [44]. For now, it might be worth delving slightly deeper
into one class of methods – adversarial domain adaptation for learning invariant
representations – since we will use this method for transferring our mid-level
model to the solo piano domain.

6.2.1 Learning Invariant Representations

A deep neural network is essentially a series of layers of feature extractors where


the final layer maps the features to the output. If we consider a transition point at
any layer in this pipeline, the part from the input to that point can be considered
a feature extractor module, and the part from that layer to the output can be
considered a regressor or classifier that maps the internal feature representations
to the output.
6.2 domain adaptation: what is it? 83

In other words, consider a deep neural network represented by h : X 7→ Y ,


which is a composition of two partitions of the network h1 : X 7→ Z and
h2 : Z 7→ Y , such that h = h2 ◦ h1 . Here, h1 could be said to be the feature extractor
and h2 the regressor or classifier. In order for h2 to be robust to domain shift, the
learnt feature representation in Z (which is the input to h2 ) should have a similar
distribution regardless of which domain (source or target) the actual inputs are
coming from. This is the central idea behind learning invariant representations
for domain adaptation – we want our model to learn representations that are
insensitive to the input domain while still capturing rich information about both
the source and target domains needed for the actual task. Such a representation
would allow us to generalise to the target domain by only training with data from
the source domain. The formal analysis given below is derived primarily from
Zhao et al. [186] and Ben-David et al. [22].
Ben-David et al. [22] lays down a rigorous foundation in the form of the
following generalisation bound, that serves as a justification for this approach.
Let H be a hypothesis class (a set of possible models) and ps (x) and pt (x) be the
marginal distributions of the source and target domains, respectively. For any
h ∈ H, the following generalisation bound holds:

e T ( h ) ≤ eS ( h ) + d ( p s , p t ) + λ ∗ (6.1)

where λ∗ = infh∈H (eS (h) + eT (h)) is the optimal joint error achievable on both
domains, and d( ps , pt ) is a discrepancy measure between the source and target
distributions. In other words, the above equation states that the target risk is
bounded by three terms: the source risk (the first term), the distance between the
marginal data distributions of the source and target domains (the second term
in the bound), and the optimal joint error achievable on both domains (the third
term in the bound). The interpretation of the bound is as follows: if there exists a
hypothesis that works well on both domains, then in order to minimise the target
risk, one should choose a hypothesis that minimises the source risk while at the
same time aligning the source and target data distributions.
To measure the alignment between two domains S and T , it is crucial to
empirically compute the distance d(S , T ) between them. To this end, Ben-David
et al. [23] proposed several theoretical distance measures for distributions (H ∆H-
divergence, A-distance). The H ∆H-divergence can be estimated empirically by
computing the model-induced divergence. To do this, we calculate the distance
between the induced source and target data distributions on the representa-
tion space Z h1 formed by a model h1 . This allows one to estimate the domain
divergence from unlabelled data as well.
The upshot of Equation 6.1 is that in order to minimise the risk on the target
domain, we would like to learn a parametrised feature transformation h1 : X 7→ Z
such that the induced source and target distributions (on Z ) are close, as measured
by the H-divergence4 , and at the same time, ensuring that the learnt feature
4 Zhao et al. [186] suggest that learning an invariant representation and achieving a small source
error is not enough to guarantee target generalisation in a classification domain adaptation task.
They propose additional bounds that translate to sufficient and necessary conditions for the success
of adaptation. However for our present context, we continue with this method since we achieve
successful domain adaptation in practice.
6.2 domain adaptation: what is it? 84

transformations are useful for the actual prediction task. The transformation h1
is called an invariant representation w.r.t. H if dH (DSh1 , DTh1 ) = 0, where DSh1 and
DTh1 are the induced source and target distributions respectively. Depending on
the application, one may also seek to find a hypothesis (on the representation
space Z h1 ) that achieves a small empirical error on the source domain.

6.2.2 Domain-Adversarial Training

The above discussion suggests that for effective domain transfer to be achieved,
predictions must be made based on features that cannot discriminate between the
training (source) and test (target) domains. How can we force a model to learn
such domain invariant features? It turns out that an inspiration can be taken from
the dynamics of training Generative Adversarial Networks (GANs) [75].
GANs are deep learning based models that learn to generate realistic samples
of the data they are trained on by learning the distribution of the training data.
They are composed of two sub-networks – a generator G and a discriminator D –
that compete with each other in a two-player game. The objective of the generator
is to map a random input (“noise”) to a point in the underlying distribution of the
data while the objective of the discriminator is to predict whether the input to it
comes from the actual training data, or is generated by the generator. This training
setup creates an adversarial minimax training dynamic with the discriminator
being the generator’s “adversary”. One consequence of this adversarial training
is that the generator is forced to align the distribution of its output to that of the
underlying training data, thus maximising the discriminator’s error.
The success of adversarial learning as a powerful method of learning and
aligning distributions has motivated researchers to apply it in the context of
domain adaptation by using it to learn invariant representations. The idea is that
adversarial training can be used to minimise the distribution discrepancy between
the source and target domains to obtain transferable and domain invariant
features.
One specific method, proposed in Ganin and Lempitsky [71] and analysed
in detail in Ganin et al. [72], introduces an additional discriminator branch
to a neural network, which is connected to the main network via a gradient
reversal layer. The reversal layer essentially reverses the training signal from
the discriminator, hence forcing the part of the network feeding into it to learn
features that confuse the discriminator, while the discriminator itself gets better
at distinguishing between the two domains. As the training progresses, the
approach promotes the emergence of features that are (i) discriminative for the
main learning task on the source domain and (ii) indiscriminate with respect to
the shift between the domains. This technique does not require any labelled data
from the target domain and hence is an unsupervised domain adaptation method.
We will describe this method in more detail in Section 6.4.2.
6.3 visualising the domain shift 85

0.5 KL(p||n) = 0.12 KL(p||n) = 0.55 KL(p||n) = 0.15 KL(p||n) = 0.53

0.4

0.3
Density
0.2

0.1

0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
melodiousness articulation rhythm_complexity rhythm_stability
0.5 KL(p||n) = 0.59 KL(p||n) = 0.07 KL(p||n) = 0.10

0.4

0.3
Density

piano
0.2 non-piano
0.1

0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
dissonance tonal_stability minorness

Figure 6.2: Distribution of annotations in the Mid-level Features Dataset for two domains:
piano and non-piano.

6.3 visualising the domain shift


Having looked at the working principle of domain adaptation, let us turn back
to our original goal – how do we adapt a mid-level model trained on a dataset
consisting mostly of multi-instrumental popular music to solo piano music? To
answer this, we take a closer look at the data. Initially, it may not be immediately
obvious that a domain shift should exist between between the distributions of
the Mid-level Features dataset and that of solo piano music. After all, we have so
far only based our assumptions about the domain shift on our perception of the
sonic difference between the styles of music. Is this reflected in the data as well?
Moreover, if there indeed is a domain shift, what kind of shift is seen5 ?
First, while there are few solo piano samples in the Mid-level Features dataset,
we partition the dataset into a set containing only solo piano clips, and a set
containing all other clips. The author manually listened to all the clips to make
this partition (the partition containing solo piano clips also serves as a test set for
the domain adaptation experiments later in this chapter). This resulted in a solo
piano subset of the Mid-level Features dataset with 194 clips.

6.3.1 Visualising Prior Shift

While we have a greatly unequal number of samples from the two subsets (only
194 in the piano set compared to 4806 in the non-piano set), it might still be
instructive to plot the label histograms for the mid-level features corresponding
to the two subsets. This will let us visualise the approximate prior shift (or label
shift, or target shift) present in the Mid-level Features dataset between piano
and non-piano instances. The label distributions are shown in Figure 6.2. The
Kullback–Leibler divergences of the piano set label distributions from the non-

5 We will only consider prior shift and covariate shift here.


6.4 bridging the domain gap 86

piano set label distributions, KL( p||n), are also calculated. We can see that there is
minimal distribution mismatch in the ratings of melodiousness, rhythm complex-
ity, tonal stability, and minorness. Articulation, rhythm stability, and dissonance
show more apparent distribution shifts, however, there is still substantial overlap
between the distributions.

6.3.2 Visualising Covariate Shift

As a quick visual inspection of the input data, it is helpful to look at the spectro-
grams of samples picked at random from the piano and non-piano subsets of
the Mid-level Features dataset. Figure 6.3 shows this. We can see that the spec-
trograms coming from the piano subset are distinct from the Mid-level dataset.
Some of the features that are easily distinguishable visually are the absence of
high-frequency content and “vertical” lines (typically indicative of percussive
sounds) in the solo piano music, while having more stable “horizontal” lines
(typically indicative of pitched sounds).
It might be more useful to look at the input data distributions, as we did
for label distributions. Recall that covariate shift means that the marginal input
distributions of the two domains are different (ps (x) 6= pt (x)). However, note
that in our case, the dimensionality of the input space is huge (equal to the total
number of pixels in each spectrogram image). How do we plot and visualise
the marginal distributions in this case? One idea is to transform the input data
samples to obtain embeddings and then to project these embeddings on a two-
dimensional plane using a distance-based projection such as t-SNE [124]. To
do this, we train a mid-level feature model using the RF-ResNet model from
Chapter 4 on the source domain so that the model learns transformations relevant
to predicting mid-level features from spectrograms. Embeddings of size 512 from
the second-to-last layer of this model are then extracted for all the piano samples
and a random selection of the non-piano subset of the Mid-level Features dataset.
Samples taken from the MAESTRO dataset [85] (see Section a.4 in Appendix a
for a brief description of the dataset) that contains solo piano recordings are also
transformed in this way and combined with the embeddings from the Mid-level
dataset. This matrix of embeddings is then projected on a 2-D space using t-SNE,
shown in Figure 6.4.
Looking at the distribution of the embeddings projected with t-SNE validates
our suspicion that solo piano music indeed forms a cluster that is distinct and
shifted from the cluster formed by non-piano samples. This points to the presence
of covariate shift between piano and non-piano samples.

6.4 bridging the domain gap


From the previous section, we can see that the piano and non-piano subsets
exhibit a covariate shift. We also noted label shift in some of the mid-level features
(articulation, rhythm stability, and dissonance). Seeing that the label shift is not
severe, and that we have access to only a small number of labelled instances of
piano music, we will not attempt to close the domain gap in label distributions.
6.4 bridging the domain gap 87

10852

3229
Frequency (Hz)

958

32
10852

3229
Frequency (Hz)

958

32
Time

(a) Non-piano samples in Mid-level Features dataset


10852

3229
Frequency (Hz)

958

32
10852

3229
Frequency (Hz)

958

32
Time

(b) Piano samples in Mid-level Features dataset

Figure 6.3: Spectrograms for (a) non-piano recordings and (b) solo piano recordings,
shown for visual inspection. The difference in spectrogram features for the
two domains is apparent, for instance piano samples lack very high frequency
content, percussive “vertical” spectral elements, and vocal formants.

Methods such as importance weighting for adapting to label shift require labelled
instances [110]. Instead, we will treat our present problem as a covariate shift
problem without target labels, and explore an unsupervised domain adaptation
approach.
We bridge the domain gap in three steps. In the first step, we verify the
importance of using regularisation to improve model generalisation. Models that
generalise well to minority classes could be expected to perform well for shifted
6.4 bridging the domain gap 88

40

30

20

10

10

20

30

40 Non-piano in Mid-level Dataset


Piano in Mid-level Dataset
MAESTRO Dataset
50
60 40 20 0 20 40 60

Figure 6.4: t-SNE plot of samples drawn from the piano and non-piano subsets of the
Mid-level Features dataset. Samples drawn from the MAESTRO dataset are
also shown. All the samples are passed through a trained mid-level features
model (trained on the source domain) and the embeddings of size 512 are
extracted. The t-SNE is then applied to this matrix of embeddings.

domains (as long as the shift is not severe). Therefore, using such a model for any
subsequent domain adaptation strategies can potentially improve our results in
terms of target domain performance.
The next step is the main domain adaptation step. We choose an unsupervised
domain adaptation approach to reduce the sensitivity of our model to covariate
shift by learning a feature space invariant to domain shift.
The third step is a refinement of the unsupervised domain adaptation process to
further boost performance using a self-training method. These steps are described
in detail in the following sub-sections.

6.4.1 Step 1: Receptive-field Regularisation

The receptive-field regularised ResNet (RF-ResNet), used previously in Sec-


tion 4.2.3, was motivated by its improved performance in audio and music
tasks, which can be ascribed to its better generalisation to minority classes [109].
6.4 bridging the domain gap 89

In Chapter 4 we have already seen better performance of this model for predicting
mid-level feature values, compared to a VGG-ish model. Here, we describe the
model in detail, and compare its performance with the VGG-ish model towards
improving performance on the target domain.
Residual Neural Networks (or “ResNets”) were introduced by He et al. [87]
to address the vanishing gradient problem of very deep neural networks. The
vanishing gradient problem refers to how the magnitudes of error gradients di-
minish across layers when backpropagating during the training process, resulting
in progressively slower fitting of the parameters the further they are from the
output layer. ResNets overcome this problem by introducing skip connections
between layers through which the gradients can flow without getting diminished.
ResNets have shown great promise in large scale image recognition tasks.
However, previous research has shown that the performance of vanilla ResNets
on the audio domain [108] is poorer in comparison and thus, until recently, most
deep networks used in the audio domain were built on the VGG-ish architecture
[51]. One of the reasons for this, as identified by Koutini et al. [108], is that the
deeper a convolutional model is, the larger its receptive field on the input plane.
Receptive field refers to the total effective area of the input image that is “seen”
by the output neurons, and is affected by factors such as the filter size, stride,
and dilation of all the layers that precede the output. This is in contrast to a fully
connected architecture, where each neuron is affected by the whole input. The
maximum receptive field of a model employing convolutional layers is given by
the following equation:

S n = S n −1 s n
(6.2)
RFn = RFn−1 + (k n − 1)Sn

where sn , k n are the stride and kernel size, respectively, of layer n, and Sn , RFn
are the cumulative stride and receptive field, respectively, of a unit from layer n
to the network input. While this gives us the maximum receptive field, it bounds
the effective receptive field (or what the output “actually sees”) which could be
lower and can be computed using the gradient-based method in [123]. Since
ResNets can be made much deeper than VGG-ish models without compromising
the training process, this also increases their maximum receptive field, resulting
in a greater possibility of overfitting, particularly when training data is in limited
quantity.
Therefore, as a first step towards improving out-of-domain generalisation of
mid-level feature prediction, we evaluate the performance of a Receptive-Field
Regularised ResNet (RF-ResNet). Our baseline will be the performance of the
VGG-ish model from Chapter 4 for the same task. The regularisation of the
receptive field in the RF-ResNet is done through the following methods, as given
in Koutini et al. [108]:

• Changing filter sizes: We begin by taking a standard small ResNet model


(18-layer [87]) and reducing the kernel sizes of some of the layers from 3 × 3
to 1 × 1. It is more desirable to change the kernel sizes of the layers towards
the output than the ones towards the input, since we want to maintain the
6.4 bridging the domain gap 90

inductive bias (weight sharing) of the convolutional architecture provided


by kernels of larger sizes. So we reduce the kernel sizes of the final layers.

• Making the architecture shallower: Reducing parameters in any way re-


duces variance of a model, and the same applies here. Reducing the depth
of the model achieves this while also reducing the maximum receptive field,
as is evident from Equation 6.2

• Changing pooling layers: Max-pooling layers have a stride of 2, effectively


doubling the maximum receptive field for all layers that follow. Thus, we
can reduce the maximum receptive field by removing some max-pool layers.
The smaller receptive field of the RF-ResNet prevents overfitting and improves
generalisation, which is particularly useful when the training data is limited in
quantity. In Section 6.5, we will compare the performance of several RF-ResNet
models with different number of layers and different kernel sizes relative to a
basic VGG-ish model as well as to standard ResNet models.

6.4.2 Step 2: Unsupervised Domain Adaptation (DA)

In this step, we move from domain generalisation to domain adaptation for our
specific target domain. Unsupervised adaptation for covariate shift has been
researched extensively in the machine learning and statistics literature. Going
through the whole host of possible methods [185] is beyond the scope of this
thesis, and thus we select the most practically suitable method for our use case,
based on preliminary trial experiments.
We adopt the reverse-gradient method introduced in Ganin and Lempitsky [71],
which achieves domain invariance by adversarially training a domain discrimi-
nator attached to the network being adapted, using a gradient reversal layer. The
procedure requires a large unlabelled dataset of the target domain in addition to
the labelled source data. The discriminator tries to learn discriminative features of
the two domains but due to the gradient reversal layer between it and the feature
extracting part of the network, the model learns to extract domain-invariant
features from the inputs.
We now provide a brief formal description of this procedure, paraphrased
from Section 3 of Ganin and Lempitsky’s paper [71]. We have input samples
x ∈ X and corresponding outputs y ∈ Y coming from either of two domains:
S = {(xi , yi )}i ∼ ps (x, y), the source domain, and T = {(x j , y j )} j ∼ pt (x, y), the
target domain, both defined on X × Y. The distributions ps and pt are unknown
and are assumed to be similar but separated by a domain shift. Additionally, we
also have the domain label di for each input xi , which is a binary variable that
indicates whether xi comes from the source distribution (xi ∼ ps (x) if di = 0) or
from the target distribution (xi ∼ pt (x) if di = 1). During training, the ground
truth values (yi ) of samples coming from only the source dataset are known, while
the domain indicator values (di ) of samples coming from both the source and
target datasets are known (by definition). During test time, we want to predict
the task values (mid-level features in our case) for the samples coming from the
target domain.
6.4 bridging the domain gap 91

Loss

task
prediction

features domain
prediction

grad
reverse Loss

feature extractor forwardprop

task regressor
backprop
domain classifier

Figure 6.5: Unsupervised domain adaptation using reverse gradient method. Schematic
adapted from Ganin and Lempitsky [71].

The reverse-gradient architecture consists of three parts: a feature extractor G f ,


a task regressor Gy , and a domain classifier Gd . The feature extractor G f maps
the inputs x to a D-dimensional feature vector z ∈ RD on the representation
space and is parametrised by θ f , i.e. z = G f (x; θ f ). The task regressor Gy maps
the feature vector z to the task prediction y, and is parametrised by θy , i.e.
y = Gy (z; θy ). Similarly, the domain classifier Gd maps the same feature vector z
to the domain label d and is parametrised by θd , i.e. d = Gd (z; θd ).
During training, we wish to achieve two simultaneous objectives. First, we
aim to minimise the task prediction loss for the input samples coming from the
source domain. In this case, the parameters of both the feature extractor G f and
the task predictor Gy are optimised, since this ensures that G f learns to map
the inputs to features that are useful for the task at hand, and Gy is able to fit
the learned feature vectors to the annotations available for the source domain.
Second, we aim to make the features z domain-invariant, i.e. we want to make
the distributions DS = { G f (x; θ f )|x ∼ ps (x)} and DT = { G f (x; θ f )|x ∼ pt (x)} to
be similar. We can see that the greater the similarity between the feature vector
distributions of the source and target domains, the greater will be the loss of a
trained domain classifier. Therefore, one way to achieve domain-invariance is
to seek the parameters θ f of the feature mapping that maximise the loss of the
domain classifier, while simultaneously seeking the parameters θd of the domain
6.4 bridging the domain gap 92

classifier that minimise the loss of the domain classifier. The system thus forms an
adversarial training scheme.
Together with the task prediction loss, this procedure can be expressed as an
optimisation of the functional:

N N
E(θ f , θy , θd ) = ∑ Liy (θ f , θy ) − λ ∑ Lid (θ f , θd ) (6.3)
i =1 i =1
d i =0

where Liy is the loss function for label prediction and Lid is the loss function for
the domain classification, both evaluated at the i-th training example. We seek
the parameters θ̂ f , θ̂y , θ̂d given by the following:

(θ̂ f , θ̂y ) = argmin E(θ f , θy , θ̂d ) (6.4)


θ f ,θy

θ̂d = argmax E(θ̂ f , θ̂y , θd ) (6.5)


θd

It can be shown that in order to achieve this, θ f needs to be updated with the
∂Liy ∂Li ∂Liy
gradient ∂θ f − λ ∂θ df , while θy and θd update with their usual gradients ∂θy and
∂Lid
∂θd respectively.
How does one optimise the feature extractor with the task regressor and domain
classifier derivatives combined in this fashion? This is done by introducing a
parameter-free gradient reversal layer between the domain classifier and the feature
extractor that simply multiplies the gradient flowing from the domain classifier
by a negative factor (−λ) during the backward pass, resulting in the combined
derivative as required. The gradient reversal layer can be implemented easily in
any stochastic gradient descent framework. After training, only the task predictor
branch of the network is used to generate predictions for the test dataset. For
a more in-depth analysis and this method’s relation to the H ∆H-distance [22],
the reader is encouraged to refer to Ganin and Lempitsky [71]. A diagrammatic
representation of the architecture is given in Figure 6.5.

6.4.3 Step 3: Ensemble-based Self-Training (ST)

In the third and final step on our path to extract mid-level features from piano
music, we aim to further refine our (already) domain adaptated model using
a self-training scheme, aimed at reducing the variabilty of models from the
previous step. To do this, we train multiple domain-adapted models using the
unsupervised DA method described in Step 2 and use these as teacher models to
assign pseudo-labels to an unlabelled piano dataset. Before the pseudo-labelling
step, we select the best performing teacher models with the validation set. Even
though the validation set contains data from the source domain, this step ensures
that models with relatively lower variance are used as teachers. This helps filter
out the particularly poorly adapted models from the previous step, which may
occur due to the inherently less stable nature of adversarial training methods [42].
6.4 bridging the domain gap 93

Figure 6.6: Ensemble-based self-training. An ensemble of teacher models (domain


adapted using unsupervised domain adaptation) is selected, which is used to
pseudo-label an unlabelled dataset of piano audio recordings with predicted
mid-level features. The pseudo-labelled dataset (combined with the original
labelled dataset) is then used to train a student model.

After selecting a number of teacher models (in our experiments, we used four),
we label a randomly selected subset of our unlabelled dataset using predictions
aggregated by taking the average. This pseudo-labelled dataset is combined with
the original labelled source dataset to train the student model. We observed
that the performance on the test set, which comes from the target domain (the
experimental setup is explained in the next section), increased until the pseudo-
labelled dataset was about 10% of the labelled source dataset in size, after which
it saturated.
The teacher-student scheme allows the collective “knowledge” of an ensemble
of adapted networks to be distilled into a single student network. The idea of
knowledge distillation, which was originally introduced for model compression
in Hinton et al. [90], has been used for domain adaptation in a supervised setting
previously in Asami et al. [15]. The distillation process functions as a regulariser
resulting in a student model with better generalisability than any of the individual
6.5 experiments and results 94

teacher models alone. Additionally, it can be thought of as a stabilisation step


helping to filter out the adversarially adapted models that result from non-optimal
convergence. Figure 6.6 shows this scheme.

6.5 experiments and results


The setup for our experiments is as follows. We wish to train a mid-level predictor
model using the Mid-level Features dataset and ultimately use it to extract mid-
level features from commercial recordings of classical solo piano performances (for
example recordings of Bach’s Well-Tempered Clavier, as we will see in Chapter 7).
The source domain here is the Mid-level Features dataset, and the target domain
is classical solo piano recordings. For domain adaptation, we use solo piano
recordings available in the MAESTRO dataset v2.0.0 [85] (see also Appendix a
Section a.4) as unlabelled target domain data. In order to test the performance
of our domain adapted models, we manually select the few recordings from the
Mid-level Features dataset that happen to be solo piano recordings, so that we
have the ground-truth mid-level feature values for this subset of recordings. The
(“piano”/target) subset of the Mid-level Features dataset thus obtained contains
194 samples, out of which we hold out 40% (79 samples) as the test set, and the
rest are added back to the Mid-level Dataset, which is then split into training
(90%), validation (2%) and (“non-piano”/source) test (8%) sets such that the
artists in these sets are mutually exclusive. The validation set is used to tune
hyperparameters and for early stopping.
The inputs to all our models are log-filtered spectrograms (149 bands) of 15-
second audio clips sampled at 22.05 kHz with a window size of 2048 samples
and a hop length of 704 samples, resulting in 149×469-sized tensors. We use
the Adam optimizer with a multi-step learning rate scheduler as our training
algorithm.

• In Step 1, we experiment with the modifications laid out in Section 6.4.1


and arrive at a 12-layer ResNet model with kernel size of 1 × 1 in stages 2
and 3, which gives the most promising results. From Figure 6.7 we see that
this architecture improves performance relative to the basic VGG-ish model
as well as relative to the standard ResNet models.

• In Step 2, the recordings from the MAESTRO dataset are split into 15-second
segments and a random subset with the number of samples equal to that in
the (“non-piano”/source) training set is sampled on each run. The model
is trained using the backpropagation method of domain adaptation as
described previously. Better convergence is obtained by gradually ramping
up, over 20 epochs, the amount of reversed gradient that passes to the
feature extractor from the discriminator branch.

• In Step 3, a random subset of 500 (10% of the size of the Mid-level Features
dataset) recordings is sampled from the MAESTRO dataset to be pseudo-
labelled by the teacher models trained in Step 2. We use 4 teacher models
to pseudo-label the unlabelled piano recordings (the predictions from the
6.5 experiments and results 95

0.50
(12, 1, 1) (12, 3, 3) RF-ResNet
(12, 3, 1) ResNet
0.48 (20, 3, 1) VGG-ish
Solo piano test set performance
(Pearson correlation coefficient)
(20, 1, 1)
0.46
ResNet50
0.44
VGG-ish
0.42
ResNet18
0.40
ResNet34
0.38
100 2 3 4 5 101 20 30
Number of parameters (×106)

Figure 6.7: Performance of different model architectures. The tuples for the RF-ResNet
models are to be read as (number of layers, kernel size of stage 2
blocks, kernel size of stage 3 blocks) . We select the (12, 1, 1) variant
for further domain adaptation steps.

teacher models are averaged for each recording and each mid-level feature).
The pseudo-labelled samples are then combined with the original source
dataset, and the final student model (RF-ResNet DA+ST) is trained with
this combined dataset.

A summary of results of the above steps is presented in Figure 6.8. We observe


that each of the steps results in an improvement in the performance on the
“piano” test set without compromising the performance on the “non-piano” one.
In Figure 6.9, the results for each mid-level feature is presented before and after
applying domain adaptation and self-training refinement to the RF-ResNet model.
To investigate our results further, we look at the discrepancy between the
source and target domains in the representation space, since the performance
of a model on the target domain is bounded by this discrepancy [22]. We use
the method given in Sun et al. [167] to compute the empirical distributional
discrepancy between domains for a trained model φ, which is given as D (S0 , T 0 ; φ)
in Equation 6.6:

1 1
D (S0 , T 0 ; φ) =
m ∑ 0 φ( x ) − n ∑ 0 φ( x ) (6.6)
x ∈S x∈T 2

where S0 is a population sample of size m from the source domain and T 0 is


a population sample of size n from the target domain. Figure 6.10 shows the
average discrepancy between the domains across training runs, where we see that
domain adaptation is able to achieve a lower discrepancy for the entire duration
6.5 experiments and results 96

0.70
non-piano
piano 0.64
0.65 0.64

Pearson Correlation Coefficient


0.62
0.60
0.60
0.56
0.55 0.53
0.50 0.49

0.45 0.43
0.40
VGG-ish RF-ResNet RF-ResNet RF-ResNet
DA DA+ST

Figure 6.8: Summary of the performance of mid-level feature models on non-piano and
piano test sets, with progressive steps of domain generalisation (RF-ResNet),
adaptation (DA), and self-training refinement (DA+ST).

0.9
non-piano, no DA
0.8 piano, no DA
Pearson Correlation Coefficient

piano, DA+ST
0.7
0.6
0.5
0.4
0.3
0.2
melodiousness articulation rhythm rhythm dissonance tonal minorness
complexity stability stability

Figure 6.9: Performances for each mid-level feature, compared for the piano and non-
piano test sets, for a domain adapted and refined RF-ResNet model (DA+ST)
and an RF-ResNet model without any domain adaptation (no DA).

of training, meaning that the model learns invariant feature transformations from
early on in the training process. Computing the final discrepancies for the VGG-
ish model and after domain adaptation steps, we observe that the discrepancy
decreases for each step (Figure 6.11), justifying our three-step approach and
explaining the improvement in performance. In Figure 6.12, embeddings of piano
and non-piano samples from the representation space of a domain adapted model,
projected using t-SNE on a 2-D plane, are shown. We can see that in this case,
samples from both domains are mapped to overlapping regions. Compare this to
the case without domain adaptation, shown earlier in Figure 6.4.

6.5.1 Modeling Performance Descriptors from the Con Espressione Dataset

Returning to our original motivation – to study subtle differences in emotional


character in (classical) solo piano music – we look at modelling emotional or
6.5 experiments and results 97

DA=False
0.8 DA=True

Discrepancy 0.6

0.4

0.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Epoch

Figure 6.10: Mean discrepancy between piano and non-piano domains over a training run
(averaged across multiple runs; shaded areas indicate standard deviation).
We see that for the entire duration of training, domain adaptation keeps the
discrepancy between the two domains lower than the run without domain
adaptation.

expressive character of piano recordings available from the Con Espressione Dataset
[37] using mid-level features predicted using our models. In the Con Espressione
Game, participants listened to extracts from recordings of selected solo piano
pieces (by composers such as Bach, Mozart, Beethoven, Schumann, Liszt, Brahms)
by a variety of different famous pianists (for details, see [37]) and were asked to
describe, in free-text format, the expressive character of each performance. Typical
characterisations that came up were adjectives like “cold”, “playful”, “dynamic”,
“passionate”, “gentle”, “romantic”, “mechanical”, “delicate”, etc. From these
textual descriptors, the authors obtained, by statistical analysis of the occurrence
matrix of the descriptors, four underlying continuous expressive dimensions
along which the performances can be placed. These are the (numeric) target
dimensions that we wish to predict via the route of mid-level features predicted
from the audio recordings.
We investigate whether our domain-adapted models can indeed predict better
mid-level features for modelling the expressive descriptor embeddings of the
Con Espressione dataset. We do this by predicting the average mid-level features
(averaged over the temporal axis) for each performance using our models and
training a simple linear regression model on these features to fit the four em-
bedding dimensions. Even though this is a very abstract task, for a variety of
reasons – the noisy and varied nature of the human descriptions; the weak nature
of the numeric dimensions gained from these; the complex and subjective nature
of expressive music performance – it can be seen (Table 6.1) that the features
predicted using domain-adapted models give comparatively better R2 -scores for
all four dimensions.
In Table 6.2, we take a closer look at Dimension 1 – the one that came out
most clearly in the statistical analysis of the user responses and was characterized
6.5 experiments and results 98

0.26

Mean Discrepancy
0.24
0.22
0.20
0.18
0.16
VGG-ish RF-ResNet RF-ResNet RF-ResNet
DA DA+ST

Figure 6.11: Mean discrepancy between piano and non-piano domains for the different
domain adaptation steps. Vertical bars indicate standard deviation over eight
runs.

Dim 1 Dim 2 Dim 3 Dim 4 Mean

VGG-ish 0.36 0.38 0.22 0.07 0.25


RF-ResNet 0.45 0.24 0.24 0.11 0.26
RF-ResNet DA 0.44 0.37 0.22 0.12 0.29
RF-ResNet DA+ST 0.51 0.40 0.24 0.14 0.32

Table 6.1: Coefficient of determination (R2 -score) of description embedding dimensions


of the Con Espressione game using a linear regressor trained on predicted
mid-level features.

RF-ResNet RF-ResNet DA+ST


Feature r Feature r
articulation 0.47 melodiousness −0.39
rhythmic complexity 0.41 articulation 0.46
rhythmic complexity 0.41
dissonance 0.40

Table 6.2: Pearson correlation (r) for mid-level features with the first description embed-
ding dimension, with (right) and without (left) domain adaptation. Features
with p < 0.05 and |r | > 0.20 are selected. This dimension has positive loadings
for words like “hectic”, “irregular”, and negative loadings for words like “sad”,
“gentle”, “tender”.

by descriptions like ranging from “hectic” and “agitated” on one end to “calm”
and “tender” on the other [37] (and also the dimension that is best predicted
by our models). Looking at the individual mid-level features, we find that, first
of all, the predicted features that show a strong correlation with this dimension
do indeed make sense: one would expect articulated ways of playing (e.g., with
strong staccato) and rhythmically complex or uneven playing to be associated
6.6 conclusion 99

40

30

20

10

10

20

30

40 Non-piano in Mid-level Dataset


Piano in Mid-level Dataset
MAESTRO Dataset
50
40 20 0 20 40 60

Figure 6.12: Embeddings of piano and non-piano samples from the representation space
of a domain adapted model, projected using t-SNE on a 2-D plane.

with an impression of musical agitation. What is more, after domain adaptation,


the set of explanatory features grows, now also including perceived dissonance as
a positive, and perceived melodiousness of playing as a negative factor – which
again makes musical sense and testifies to the potential of domain adaptation for
transferring explanatory acoustic and musical features.

6.6 conclusion
We began this chapter with the problem of domain shift in case of piano versus
non-piano music recordings in the Mid-level Feature dataset. We are interested in
closing this domain gap in our mid-level feature models because we are ultimately
interested in studying emotional variation in piano performances, with the hope
of capturing subtle expressive differences between different performances, and in
the process disentangling the musical factors underlying such emotional variation.
In order to apply a mid-level feature model to solo piano music, it is necessary to
ensure that our model works as expected on solo piano music.
6.6 conclusion 100

To this end, we presented a three-step approach to adapt mid-level models for


recordings of solo piano performances. First, we explored the RF-ResNet model,
previously used in Chapter 4, to improve model generalisation. Then, we signifi-
cantly improved the performance of these models on piano audio by performing
unsupervised domain adaptation followed by a self-training refinement scheme.
We also demonstrated improved prediction of meaningful perceptual features
corresponding to expressive dimensions.
We are now ready to delve into our goal of analysing expressive qualities and
emotions in piano recordings in a systematic manner. In the following chapter,
we will use a domain adapted mid-level model to extract mid-level features in
a set of recordings of Bach’s Well-Tempered Clavier Book 1, played by famous
pianists. We will use these features to predict arousal and valence across different
splits, and then compare the performance metrics with those obtained using
other feature sets. Through our experiments, we will test the different feature sets
on their capacity to model emotion across different pieces and across different
performances of the same piece. Domain adaptation will also be used to train an
end-to-end audio-to-emotion model for piano recordings, which will be used to
extract a baseline feature set.
7
D I S E N TA N G L E : E M O T I O N I N E X P R E S S I V E
PIANO PERFORMANCE

7.1 The Data: Bach’s Well-Tempered Clavier . . . . . . . . . . . . . . . . . 102


7.2 Feature Sets for Emotion Modelling . . . . . . . . . . . . . . . . . . 104
7.3 Feature Evaluation Experiments . . . . . . . . . . . . . . . . . . . . 107
7.4 Probing Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 119

While music emotion recognition research has seen several advancements in


recent years [81], there has been little work on the problem of identifying emo-
tional aspects that are due to the actual performance, and even less on models that
can automatically recognise this from audio recordings. It has been convincingly
demonstrated by Akkermans et al. [8] and Gabrielsson and Juslin [68] that per-
formers are capable of communicating, with high accuracy, intended emotional
qualities by their playing. An even more striking finding is that different ver-
sions of the same piece can convey different emotions, with a high correlation
between the intended emotion behind each version and the perceived emotion. As
Gabrielsson [67] reports, in an experiment, professional performers playing violin,
saxophone, and singing were asked to perform a tune in order to make it sound
happy, sad, angry, fearful, solemn, tender, and without expression. Listeners,
asked to rate each version on each of the seven emotions, were consistently able
to identify most of the intended emotions with high significance values.
We explored in the previous chapter a method to adapt mid-level feature
extraction models to solo piano audio recordings. In this chapter, we focus our
attention to identifying emotional expression in solo piano performances, where
the domain adapted models are used to extract mid-level features from these
recordings. The most directly relevant prior work that concerns recognising
emotional expression in piano performances from audio recordings, that we
are aware of, is Grekow [76], where 324 6-second audio snippets of different
genres (classical, jazz, blues, metal, etc.) were annotated in terms of perceived
emotion (valence and arousal), and various regressors were trained to predict
these two dimensions from a set of standard audio features. The regression
models were then used to predict valence-arousal trajectories over 5 different
recordings of 4 Chopin pieces, but no ground truth in terms of human emotion
annotations was collected. The relevance of the model’s predictions was evaluated

101
7.1 the data: bach’s well-tempered clavier 102

only indirectly, by comparing similarity scores between predicted profiles with


overall performance similarity ratings by three human listeners, which showed
some non-negligible correlations.
In a recent focused study [20], Battcock & Schutz (referred to as “B&S” hence-
forth) investigate how three specific score-based cues (Mode, Pitch Height, and
Attack Rate1 ) work together to convey emotion in J.S.Bach’s preludes and fugues
collected in his Well-tempered Clavier (WTC). They used recordings of the com-
plete WTC Book 1 (48 pieces) of one famous pianist (Friedrich Gulda) as stimuli
for human listeners to rate each piece on perceived arousal and valence. Their
findings suggest that within this set of performances, arousal is significantly
correlated with attack rate and valence is affected by both the attack rate and the
mode. However, that study was based on only one set of performances, making
it impossible to decide whether the human emotion ratings used as ground
truth really reflect aspects of the compositions themselves, or whether they were
also (or even predominantly) affected by the specific (and, in some cases, rather
unconventional) way in which Friedrich Gulda plays the pieces – that is, whether
the emotion ratings reflect piece or performance aspects, or a combination of
both.
The purpose of the present chapter is to try to disentangle the possible contri-
butions and roles of different features in capturing composer-(piece-)specific and
performer-(recording-)specific aspects. To this end, we collected human ratings
of perceived valence and arousal in six complete sets of recordings of Bach’s
Well-Tempered Clavier Book 1, and then performed a systematic study with
feature sets derived from various levels of musical abstraction, including some
(e.g., our mid-level features) extracted by pre-trained deep neural networks.
The rest of this chapter is broadly based on the following publication:

• S. Chowdhury and G. Widmer


On Perceived Emotion in Expressive Piano Performance: Further Experi-
mental Evidence for the Relevance of Mid-level Features, In Proc. of the 22nd
International Society for Music Information Retrieval Conference (ISMIR 2021)

7.1 the data: bach’s Well-Tempered Clavier


The Well-Tempered Clavier (WTC) is a collection of musical pieces for the keyboard
composed by the German composer and musician Johannes Sebastian Bach (1685
– 1750). The Clavier, in Bach’s time, referred to a variety of instruments, such
as the harpsichord, clavichord, or organ (however, in modern times, the pieces
are most often performed on the piano). The full collection consists of two sets
(books) of preludes and fugues2 in all 24 major and minor keys. In the Western
classical music tradition, this collection of music is widely regarded as one of the

1 Actually, attack rate as computed by B&S is also informed by the average tempo of the performance;
thus, it is not strictly a score-only feature.
2 A Prelude is a short piece of music, typically serving as an introduction to succeeding and more
complex parts of the musical work. A fugue is a piece of music composed using two or more voices
playing a theme in counterpoint and recurring through the course of the composition.
7.1 the data: bach’s well-tempered clavier 103

most important works. We choose to use the first set (book I) in our experiments
to study the variation of emotional expression across different performances.

7.1.1 Pieces and Recordings

The Well-Tempered Clavier is ideally suited for systematic and controlled studies
of this kind, as it comprises a stylistically coherent set of keyboard pieces from a
particular period, evenly distributed over all keys and major/minor modes, with
a pair of two pieces (a prelude, followed by a fugue) in each of the 24 possible
keys, for a total of 48 pieces. Each piece has its own distinctive musical character,
and despite being written in a rather strict style and not meant to be played in
‘romantic’ ways, the lack of composer-specified tempi offers pianists (or pianists
do take) lots of liberties in the choice of tempo. For example, there are pieces in
our set of recordings that one pianist plays more than twice (!) as fast as another.
This set of pieces is also not overt in its intended emotion, leading to performers
taking greater interpretative freedom in choosing ornamentation and distinctive
style [21].
For a broad set of diverse performances, we selected six recordings of the
complete WTC Book 1, by six famous and highly respected pianists, all of whom
can be considered Bach specialists to various degrees. The recordings are listed in
Table 7.1.

pianist recording year

Glenn Gould Sony 88725412692 1962-1965


Friedrich Gulda MPS 0300650MSW 1972
Angela Hewitt Hyperion 44291/4 1997-1999
Sviatoslav Richter RCA 82876623152 1970
András Schiff ECM 4764827 2011
Rosalyn Tureck DG 4633052 1952-1953

Table 7.1: Pianists and recordings.

7.1.2 Emotion Annotations and Pre-processing

In accordance with B&S, we only use the first 8 bars of each recording for
the annotation process and our experiments. These were cut out manually. We
collected the arousal and valence annotations for the 288 excerpts by recruiting
participants in a listening and rating exercise. The participants of our annotation
exercise were students of a course at a university, without a specifically musical
background. Each participant heard a subset of the recordings (all 48 pieces as
played by one pianist) and was asked to rate the valence on a scale of −5 to +5
(increments of 1; a total of eleven levels) and the arousal on a scale of 0 to 100
(increments of 10; a total of eleven levels). They could listen to a recording as
7.2 feature sets for emotion modelling 104

Figure 7.1: Distribution of arousal and valence ratings for all 48 pieces. The spread is
across the 6 performances for each piece. For better comparison, the ratings
were standardised to zero mean and unit variance before plotting.

many times as they liked. Each recording was rated by 29 participants. In total,
we collected 8,352 valence-arousal annotation pairs.
For the purposes of the experiments to be described here, we take the mean
arousal and mean valence ratings for each recording, and these values serve as
our ground-truth values for all following experiments. The distributions (over the
6 performances) of these mean ratings for each piece are visualised as boxplots in
Figure 7.1. In Figure 7.2, the mean arousal and valence annotation for each of the
288 recordings is plotted on the arousal-valence plane.

7.2 feature sets for emotion modelling


We will examine emotion modelling in our collection of performances of the Well-
Tempered Clavier using four sets of features, roughly reflecting four different
approaches to feature extraction for emotion recognition. The first set consists
of low-level features – features extracted using signal processing algorithms (see
also Section 4.1). These features have been used in traditional music information
retrieval for various audio content based tasks, including emotion recognition,
however most of these features were originally not developed specifically keeping
emotional relevance in mind [144]. The next are score-based features, derived
7.2 feature sets for emotion modelling 105

100

80

60
Arousal

40

20

r = 0.55

0
4 2 0 2 4
Valence

Figure 7.2: All annotations (mean values) for the 288 recording excerpts of the 48 pieces
in Bach’s Well-Tempered Clavier Book I. We observe a Pearson correlation
coefficient between the arousal and valence annotations of 0.55.

from the music notation and not from the audio content. These features are
thus independent of the performer. The third set of features are our mid-level
features, learnt using the deep architecture explained earlier in Section 4.2, and
domain adapted for piano music using the unsupervised domain adaptation and
self-training refinement of Chapter 6. To have a fair comparison with the deep
learning based mid-level features, for the fourth set of features, we use features
extracted from an identical deep model trained end-to-end on the DEAM [10]
dataset to predict arousal and valence. Details of the four feature sets follow.

7.2.1 Low-level Features

These consist of hand-crafted musical features (such as onset rate, tempo, pitch
salience) as well as generic audio descriptors (such as spectral centroid, loudness).
Taken together, they reflect several musical characteristics such as tone colour,
dynamics, and rhythm. A brief description of all low-level features that we use
is given in Table 7.2. We use Essentia [32] and Librosa [129] for extracting these.
The audio is sampled at 44.1kHz and the spectra computed (when required)
with a frame size of 1024 samples and a hop size of 512 samples. Each feature is
aggregated over the entire duration of an audio clip by computing the mean and
7.2 feature sets for emotion modelling 106

Feature Name Feature Description

Dissonance Total harmonic dissonance computed from pairwise disso-


nance of all spectral peaks.

Dynamic Complexity The average absolute deviation from the global loudness level
estimate in dB.

Loudness Mean loudness of the signal computed from the signal ampli-
tude.

Onset Rate Number of onsets (note beginnings or transients) per second.

Pitch Salience A measure of tone sensation, computed from the harmonic


content of the signal.

Spectral Centroid The weighted mean frequency in the signal, with frequency
magnitudes as the weights.

Spectral Flatness A measure to quantify how much noise-like a sound is, as


opposed to being tone-like.

Spectral Bandwidth The second order bandwidth of the spectrum.

Spectral Rolloff The frequency under which 85% of the total energy of the
spectrum is contained.

Spectral Complexity The number of peaks in the input spectrum.

Tempo (BPM) Tempo estimate from audio in beats per minute.

Table 7.2: Low-level Features

standard deviation over all the frames of the clip (a ‘clip’ being an 8 bar initial
segment from a recording).

7.2.2 Score Features

The following set of features was computed directly from the musical score (i.e.,
sheet music) of the pieces instead of the audio files. The unit of score time, “beat”,
is defined by the time signature of the piece (e.g., 4/4 means that there are 4 beats
of duration 1 quarter in a bar). The score information and the audio files were
linked using automatic score-to-performance alignment. Table 7.3 describes the
score features in detail.

7.2.3 Mid-level Features

As described in Chapter 4, we learn the seven mid-level features from the Mid-
level Dataset [9] using a receptive-field regularised residual neural network
(RF-ResNet) model [108]. Since we intend to use this model to extract features
from solo piano recordings (a genre that is not covered by the original training
data), we use the domain-adaptive training approach described in Chapter 6 to
7.3 feature evaluation experiments 107

Feature Name Feature Description

Inter Onset Interval The time interval between consecutive notes per beat.

Duration Two features describing the empirical mean and standard devi-
ation of the notated duration per beat in the snippet.

Onset Density The number of note onsets per beat. A chord constitutes a single
onset.

Pitch Density The number of unique notes per beat.

Mode Binary feature denoting major/minor modality, computed us-


ing the Krumhansl-Schmuckler key finding algorithm [111] (to
reflect the fact that the dominant key over the segment may be
different from the given key signature).

Key Strength This feature represents how much does the tonality by the
”Mode” feature fit the snippet.

Table 7.3: Score Features

transfer the features to the domain of solo piano recordings. We use an input
audio length of 30 seconds, padded or cropped as required.

7.2.4 DEAMResNet Emotion Features

To compare the mid-level features with another deep neural network based
feature extractor, we train a model with the same architecture (RF-ResNet) and
training strategy on the DEAM dataset [10] to predict arousal and valence from
spectrogram inputs. Since this model is trained to predict arousal and valence, it
is expected to learn representations suitable for this task. As with the mid-level
model, we perform unsupervised domain adaptation for solo piano audio while
training this model also.
Features are extracted from the penultimate layer of the model, which gives us
512 features. Since these are too many features to use for our dataset containing
only 288 data points, we perform dimensionality reduction using PCA (Principal
Component Analysis), to obtain 9 components explaining at least 98% of the vari-
ance. These 9 features are named as pca x with x being the principal component
number.

7.3 feature evaluation experiments


In this section, we evaluate the four feature sets. We wish to answer the following
questions:

1. How well can each feature set fit the arousal and valence ratings? How
do these feature sets compare to the ones used by B&S? (Section 7.3.1 and
Section 7.3.2)
7.3 feature evaluation experiments 108

Arousal Valence

R̃2 RMSE r R̃2 RMSE r


Mid-level 0.84 0.36 0.93 0.79 0.42 0.91
DEAMResNet 0.91 0.27 0.96 0.69 0.50 0.86
Low-level 0.86 0.29 0.96 0.67 0.45 0.89
Score 0.31 0.74 0.67 0.61 0.55 0.83
B&S (exp 3) 0.48 − − 0.75 − −

Table 7.4: Regression on Gulda data from B&S [20]

2. In each feature set, which features are the most important? (Section 7.3.3)

3. Which feature set best explains variation of arousal and valence between
pieces? (Section 7.3.4)

4. Which feature set best explains variation of arousal and valence between
different performances of the same piece? (Section 7.3.5)

In order to evaluate the feature sets on their emotion modelling capacity, we


use ordinary least squares fitting of the emotion annotations with the four feature
sets and calculate the regression metrics. The metrics we report are adjusted
coefficient of determination ( R̃2 ), root mean squared error between true and
predicted values (RMSE), and Pearson correlation coefficient between true and
predicted values (r).

7.3.1 Evaluation on B&S Data

As a starting point, we take the data used by B&S in Experiment 3 of their paper
– Gulda’s performances rated on valence and arousal. We perform regression
with our feature sets and compare with the values obtained by B&S using their
features Attack Rate, Pitch Height, and Mode. The results are summarised in
Table 7.4.
We can see that all three audio-based features (Low-level features, Mid-level
features, and DEAMResNet features) perform sufficiently well for both arousal
and valence to motivate further analysis.

7.3.2 Evaluation on Our Dataset

Next, we perform regression on our complete dataset (comprising of 288 unique


recordings: 48 pieces × 6 pianists). The results summary can be seen in Table 7.5a.
Here we observe that while DEAMResNet Emotion features perform best on
arousal and Score features perform best on valence, Mid-level features show a
balanced performance across both the emotion dimensions.
To evaluate generalisability, we perform cross-validation with three different
kinds of splits – piece-wise (all 6 performances of a piece are test samples in a
fold, for a total of 48 folds), pianist-wise (all 48 pieces of a pianist are test samples
7.3 feature evaluation experiments 109

Arousal Valence
Feature Set R̃2 RMSE r R̃2 RMSE r

Mid-level 0.68 0.56 0.83 0.63 0.60 0.80


DEAMResNet 0.70 0.54 0.84 0.42 0.72 0.69
Low-level 0.62 0.59 0.81 0.41 0.74 0.67
Score 0.41 0.75 0.65 0.75 0.49 0.87

(a) Regression metrics. Modelling the emotion ratings of all 288 excerpts using each feature set.

Piece-wise Pianist-wise Leave-one-out


Feature Set Arousal Valence Arousal Valence Arousal Valence

Mid-level 0.68 0.63 0.68 0.64 0.69 0.65


DEAMResNet 0.67 0.37 0.61 0.41 0.68 0.43
Low-level 0.54 0.20 −0.11 −0.05 0.57 0.30
Score 0.08 0.67 0.39 0.75 0.37 0.74

(b) Cross-validation metrics. R̃2 for different cross-validation splits, with a linear regression model
using each feature set.

Table 7.5: Evaluation using goodness of fit measures of the four feature sets on our data,
on the full dataset (a), or via cross-validation (b). Refer to the description in
Section 7.3.2.

in a fold, for a total of 6 folds), and leave-one-out (one recording is the test sample
per fold, for a total of 288 folds). This is summarised in Table 7.5b.
We see that Mid-level features show good generalisation for arousal and are
robust to different kinds of splits. They also show balanced performance between
arousal and valence for all splits. The good performance of the Score features on
the valence dimension (V), here and in the previous experiment, is mostly due
to the Mode feature; there is a substantial correlation in the annotations between
major/minor mode and positive/negative valence.

7.3.3 Feature Importance within Feature Sets

Recall from Section 3.2 that one way to measure feature importance is to use the
t-value (or the t-statistic) of the weights corresponding to the features. T-statistic
is defined as the estimated weight scaled with its standard error:

β̂ j
t β̂ j = (7.1)
SE( β̂ j )

We focus on the audio-based feature sets here, as in most realistic applications


scenarios, the score information will not be available (and, being constant across
different performances, will not be able to distinguish performance aspects). We
perform a regression using all audio-based features (numbering 39 in total) and
compare the t-values in Figure 7.3.
7.3 feature evaluation experiments 110

Audio Features Importance for Arousal Audio Features Importance for Valence
6 8
Mid-level Features Mid-level Features

absolute value of T-statistic

absolute value of T-statistic


5 DEAMResNet Features DEAMResNet Features
Low-level Features 6 Low-level Features
4
3 4
2
2
1

sp ml_t _diss pca 4


tra onal_ onan_1

na ity_s a_3

g_p s_ _2
m_ _st ess
_ar ple y
ula y
n

om tabi ce
ple p lity

c_c ud ev

n
tra omp ness
str atn p xity

k_m an
ml com abilit
tic xit

ml pca_

ea
tio

on es ca
mi lo td

ea me

_rh lienc _5
y
na ess

tra com v
atn xity

tra rticu n

tra ness n
v
an
yth hm usn

x c

le

ilit

m_ tde

en stde
a ea
sp l_flat latio
a

me
n

_sa pc
tab

l_fl ple

_m
_rh yt io

yth e_s
ml inor

_
id_
ml l_rh elod

ess
l_s
_m

tro
m l_m

l_fl

ml

_
_to

m l

l_c
l_c
m

dy

ch
ec

pit
sp

ec
ec

ec
ec

ml

sp
sp
(a) T-values for Arousal (b) T-values for Valence

Figure 7.3: Feature importance for audio features using T-statistic. Only features with
p < 0.05 are shown.

We see that the top-4 and top-2 features in arousal and valence, respectively, are
mid-level features. These features also make obvious musical sense – modality is
often correlated with valence (positive or negative emotional quality), and rhythm
and articulation with arousal (intensity or energy of the emotion).

7.3.4 Modelling Piece-wise Variation

Taking a closer look at Figure 7.1, we notice (as we expect) that each piece has a
distinct emotion of itself – in terms of its own arousal and valence – which gets
modified by performers, leading to the spread we see in the arousal and valence
across performers. We can see that the spreads (variances) are not large enough
to rule out the apparent effect that “piece id”, considered as a variable, has on
the emotion of a recording. In other words, the emotion of a recording is not
independent of the piece id. The linear mixed effects model [147] is normally used
for such non-independent or multi-category data.
Mixed effect models incorporate two kinds of effects for modelling the depen-
dent variable. These effects are called fixed and random effects. Fixed effects are
those variables that represent the overall population-level trends of the data. The
fixed effect parameters of the model do not change across experiments or groups.
In contrast, random effects parameters change according to some grouping factor
(e.g. participants or items). Random effects are clusters of dependent data points
in which the component observations come from the same higher-level group
(e.g., an individual participant or item) and are included in mixed-effects models
to account for the fact that the behaviour of particular participants or items may
differ from the average trend [35].
In our case of modelling piece-wise emotion variation, the linear mixed effect
model consists of the piece id as a random effect intercept, which models part
of the residual remaining unexplained by the features we are evaluating (fixed
effects). A feature set that models piece-wise variation better than another set
would naturally have a lesser residual variation to be explained by the random
7.3 feature evaluation experiments 111

Erandom
Feature Set Arousal Valence

Mid-level 0.50 0.86


DEAMResNet 0.47 0.89
Low-level 0.66 0.90
Score 0.63 0.68

Table 7.6: Explaining piece-wise variation using the four feature sets. The fraction of
residual variance explained by the random effect of “piece id” (defined in
Section 7.3.4) is reported here. Lower means better explained.

effect. We therefore look at which feature set has the least fraction of residual
variance explained by the random effect of piece id, defined as:

Varrandom
Erandom = (7.2)
Varrandom + Varresidual

where Varrandom is the variance of the random effect intercept and Varresidual is
the variance of the residual that remains after mixed effects modelling.
We see from Table 7.6 that the DEAMResNet emotion features best explain
piece-wise variation in arousal, followed closely by Mid-level features. For valence,
the performance of all three audio-based features are close, with Mid-level features
performing the best, however, score features outperform them with a large margin.
This is again due to the relationship between mode and valence, and mode co-
varying tightly with the piece ids.

7.3.5 Modelling Performance-wise Variation

Evaluation of performance-wise variation modelling cannot be done with the


mixed effects approach as in the previous section because the means (of arousal
or valence across all pieces) are nearly identical for each pianist.
Therefore, we look at one piece at a time and compute the fraction of variance
unexplained (FVU) and Pearsons’s correlation coefficient (r) between predicted
and true values across performances for each such piece, held-out during training
(see Table 7.7). This is done as leave-one-piece-out cross-validation, and aggre-
gated by taking the means across all cross-validation folds. The p-values of the
correlation coefficients are counted and we report the percentage of pieces for
which p < 0.1. With only 6 performances per piece, a significance level of p < 0.05
is obtained for only a handful of pieces. Since score-features based predictions are
exactly equal for all performances of a piece, these metrics are not meaningful,
and hence the Score feature set is not included here.
Again, Mid-level features come out at the top in most measures. To illustrate
the modelling of performance-wise variation, we select a few example pieces that
have a high variation of emotion between performances and plot them together
with the predicted values using mid-level features in Figure 7.4. The predicted
7.4 probing further 112

Arousal Valence
Feature Set FVU r (n p<0.1 ) FVU r (n p<0.1 )

Mid-level 0.31 0.58 (47.9%) 0.36 0.42 (27.0%)


DEAMResNet 0.32 0.54 (43.8%) 0.61 0.47 (37.5%)
Low-level 0.43 0.56 (54.2%) 0.75 0.38 (22.9%)

Table 7.7: Explaining performance-wise variation using the three audio-based feature
sets. FVU: Fraction of Variance Unexplained. r: Pearson correlation coefficient.

emotion dimensions follow the ratings closely, even for performances that deviate
from the average (e.g. the arousals of Gulda’s performance of Prelude in A major
and Tureck’s performance of Fugue in E minor.)

Arousal -- Prae 19 A major Arousal -- Fuga 10 E minor


2.0
RMSE=0.10 RMSE=0.31
annot
1.5 pred
1.0
0.5
0.0
0.5
1.0
1.5 annot
pred
2.0
Valence -- Prae 14 F# minor Valence -- Fuga 18 G# minor
2.0
RMSE=0.19 RMSE=0.23
annot annot
1.5 pred pred
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Tureck

Tureck
Gould

Gulda

Schiff
Hewitt

Richter

Gould

Gulda

Schiff
Hewitt

Richter

Figure 7.4: Some example pieces with high emotion variability between performances
which are modelled particularly well using mid-level features.

7.4 probing further


We now describe two additional experiments designed to further explore the
predictive power of the feature sets.
7.4 probing further 113

Prae 07 Eb major Prae 02 C minor


2.0 Hewitt
Tureck
Schiff
GouldRichter
1.5
1.0 Gulda
Hewitt
0.5 Schiff
Richter
Arousal

0.0 Tureck
Gulda
0.5
1.0 Gould
1.5
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Valence Valence

Figure 7.5: Two examples of outlier performances: Prelude # 7 in Db major, outlier is


Gould (left); Prelude #2 in C minor (Gulda, right).

7.4.1 Predicting Emotion for Outlier Performances

Figure 7.5 shows two examples of pieces where one performance has a vastly
different emotional character than the others – in the first example, Gould even
produces a negative valence effect (mostly through tempo and articulation) in
the E-flat major prelude, which the others play in a much more flowing fash-
ion. A challenge for any model would thus be to predict the emotion of such
idiosyncratic performances, not having seen them during training.
We therefore create a test set by picking out the outlier performance for each
piece in arousal-valence space using the elliptic envelope method [158]. This gives
us a split of 240 training and 48 test samples (the outliers). We train a linear
regression model using each of our feature sets and report the performance on
the outlier test set in Figure 7.6. We see again that Mid-level features outperform
the others, for both emotion dimensions. We take this as another piece of evidence
for the ability of the mid-level features to capture performance-specific aspects.
The surprisingly good performance of score features for valence can be attributed
to the fact that for most pieces, the outlier points are separated mostly in the
arousal dimension – the spread of valence is rather small (though not always,
see the Gould case in Figure 7.5) – and the score feature “mode” is an important
predictor of valence (see earlier sections).

7.4.2 Expressive Performance Modelling in the Con Espressione Dataset

We encountered the Con Espressione Dataset briefly in Chapter 6, Section 6.5.1,


where we used the performance description dimensions computed from the
dataset to test domain adaptation of mid-level features. This dataset, although
small, is useful for demonstrating some of the results from this chapter as well
since it consists of piano pieces performed by different pianists. Let us recall
7.4 probing further 114

0.7 0.65 arousal


0.61 valence
0.6
0.52
0.5 0.45
Adjusted R2 score
0.4
0.31
0.3
0.2
0.12
0.1
0.03
0.0
-0.03
0.1
Mid-level DEAMResNet Low-level Score
Figure 7.6: Evaluation of the feature sets on a test set made up entirely of “outlier”
performances – those that are maximally distant from others in each set of 6
performances for the 48 pieces.

details about the dataset and see how we can demonstrate mid-level based
emotion prediction on this dataset.

Dataset Description

The Con Espressione dataset3 is a collection of excerpts from 9 classical piano


pieces performed by different well-regarded pianists, making a total of 45 per-
formances. Through a Web-based questionnaire, the “Con Espressione Game”
(“CEG”), free text descriptions of the performances were collected. The partic-
ipants responding to the CEG listened to the performances and were asked to
enter free text descriptions, preferably using adjectives, as many as they liked,
concentrating on the performative aspects and not on the piece itself. The target
phenomenon thus is what we would call the expressive character of performances.
In order to extract useful information from the rather unstructured space of
free text descriptions, we use the approach described in Cancino-Chacón et al.
[37]. Noting that characterization of expressive performance is not a common
case in natural language processing (NLP), and that the meanings of many terms
in the context of expressive performance are slightly different from their common
usage, Cancino-Chacón et al. [37] opt for a more basic approach – computing
the principal component analysis (PCA) on the occurrence matrix of the text
descriptions. The occurrence matrix in this case is simply a sparse matrix with
the columns representing all words in the vocabulary of the dataset, the rows
representing the responses, and the entries representing the count of each word
present in a response. Using this, we compute 4 principal dimensions.

Emotional Diversity in the Dataset Visualised in Arousal-Valence Space

As the first demonstration of the mid-level based emotion prediction for piano
performances, we train a multiple linear regression model using the Bach WTC
3 See Appendix a for more details on this dataset.
7.4 probing further 115

100

80
Vogt Vorraber
Horowitz MIDI Gould
Argerich Brendel
Gardon
60
Brendel
Schiff Schiff
Arousal

Gieseking
Kempff
Vorraber Ax Vorraber
Bavouzet
Horowitz Serkin Angelich Horowitz
40 Kempff MIDI
Gould
Katsaris Schiff Vogt Vogt Brendel
Rubinstein Richter Pires Rubinstein
Lim Horowitz Gulda Uchida Bach Prelude No.1 in C
Casadesus Lazic Argerich Stadtfeld Mozart K545 2nd mvt.
20 Gulda Beethoven Op. 27 No.2 1st mvt.
Vorraber Grimaud Brahms Op. 119 2. Intermezzo
Schirmer Liszt Bagatelle S.216a
Schumann Ara. Op. 18 (ex. 1)
Schumann Ara. Op. 18 (ex. 2)
Schumann Kr. Op. 16 (ex. 1)
Schumann Kr. Op. 16 (ex. 2)
0
4 2 0 2 4
Valence

Figure 7.7: Predicted emotions using mid-level feature predictions for the performances
from the Con Espressione Dataset (Section 7.4.2). Performances of the same
piece are marked with the same colour.

dataset described previously in this chapter (with mid-level features as inputs and
arousal-valence as outputs), and predict the emotion values for the performances
of the Con Espressione Dataset. The predictions are then visualised on the arousal-
valence plane (Figure 7.7), showing the expressive diversity in the performances.
An interesting observation in this plot is Glenn Gould’s performance of the
Mozart piece. Gould is known to have very unconventional interpretations and
a distinctive style in several performances, and his performance of the Mozart
piece indeed sticks out as an outlier in the arousal-valence space.

Effects for Arousal-Valence Prediction in the Mozart Piece

In Chapter 4, we saw that the contributions of explanatory variables to a pre-


diction can be visualised using effects plots. Let us plot the effects plot of the
mid-level to emotion regression model used here for the Mozart piece. In the
plots shown in Figure 7.8, we see that the high arousal of Gould’s performance can
be attributed to low melodiousness (resulting in high positive effect on arousal,
since melodiousness has a negative mean effect on arousal) and high rhythm
complexity (again resulting in high positive effect for arousal, since rhythm com-
plexity has a positive mean effect on arousal). Whether Gould’s performance is
7.4 probing further 116

“rhythmically complex” in any objective musical sense (and more so than the other
performances of this piece in the dataset) is debatable. The human annotations
in the mid-level feature dataset are not to be expected to reflect musicological
concepts, but rather general impressions that may well be influenced by other
factors than the one implied by the feature name. This also holds, in particular,
for the “minorness” feature – see below.
According to Figure 7.8, the high valence of Gould’s performance can be
attributed to low minorness (resulting in high positive effect on valence, since
minorness has a negative mean effect on valence). Again, it is important to
remember that the minorness feature does not purely relate to the mode of the
audio recording, but also other factors that are perceived to be associated with
“minor sounding” songs. In this sense, Gould’s performance might be perceived as
less “minor” than the other performances because of the abnormally abnormally
high tempo that Gould plays this piece in, compared to others.
MOZART
arousal valence
MIDI
minorness Gulda
tonal_stability Uchida
Pires
dissonance Gould
rhythm_stability
rhythm_complexity
articulation
melodiousness
100 50 0 50 100 5 0 5 10

(a) Effects for Mozart’s PianoMOZART


Sonata No. 16 (K545), 2nd movement

arousal valence
MIDI
minorness Gulda
tonal_stability Uchida
Pires
dissonance Gould
rhythm_stability
rhythm_complexity
articulation
melodiousness
20 0 20 2 0 2 4

(b) Effects, centred on the mean across the five performances for each mid-level feature

Figure 7.8: Effects (a) of mid-level features on prediction of valence and arousal for the
five performances of Mozart’s Piano Sonata No. 16 (K545), 2nd movement.
The diamonds represent the effects of the five performers. The box plots
represent the distribution of the effects for the training dataset (Bach WTC).
On the centred plot (b), the spread of the effects for the different performances
is more clearly visible.
7.4 probing further 117

pianist predicted descriptors sample answers


“soft, interesting”
tentative, transparent,
Pires “conventional and academic, almost Platonic,
harmonic, withheld, shiny
straightforward and traditional yet melodious”
“soft, delicate, joyful, pensive”
Uchida shy, soaring, peaceful “calm, modest, pure”
“ponderous, melancholic, slow”
“soft, quiet, tender, singing, emotional, easy, airy”
peaceful, soaring,
Gulda “studiously controlled but surprising with
melodic, shy, tired
nuances - almost sloppy but in total control”
“accentuated, staccato, comic-like, fast”
precise, harsh, intense,
Gould “fast, hectic, unrelaxed”
dirty, cerebral
“dry, playful, rushed, clean”
“non-emotional, plain”
terrible, bad, uneven,
MIDI “mechanical, flawed, stiff, undifferentiated”
metallic, anharmonic
“fake, machine like, unmoving”

Table 7.8: For five versions of Mozart’s Piano Sonata No. 16 (K545), 2nd movement, the
performance descriptor words are predicted by fitting mid-level features to
the PCA dimensions of the occurrence matrix of the Con Espressione dataset.
These are compared with actual answers that participants entered in response
to the respective performances.

Predicting Descriptive Terms for Performances

Using the PCA dimensions described above as the dependent variables, and
mid-level feature predictions as independent variables, we can train a simple
regression model mapping mid-level feature values to the description space. We
then use this model to predict the positions of the Mozart performances in the
PCA dimension space, and find the nearest words in the space from the dataset
as the descriptive words predicted for the performances. A visualisation of all
the words in the training dataset (199 words derived from a total of ∼ 1500 after
filtering for minumum number of occurrences and entropy across performances,
as described in Cancino-Chacón et al. [37]), mapped onto the first two PCA
dimensions is shown in Figure 7.9. Also plotted are the predicted positions (solid
coloured diamonds) and the ground truth positions (lighter, smaller diamonds –
obtained by computing the centroid of all ground truth word positions for each
performance) of the Mozart performances. The nearest words to the predicted
points are also highlighted4 . In Table 7.8 these predicted descriptive words are
compared to some of the answers entered by participants of the CEG in response
to the respective performances (for this table, the human answers were selected
randomly, but single-word answers were excluded).

4 In Figure 7.9, the visualised positions of the words may be shifted slightly from their actual
positions on the plane, in order to avoid too much overlapped text. We use the Python package
adjustText (https://github.com/Phlya/adjustText) to do this.
monotonous
mechanic

plain artificial without_dynamics


computer-generated
slow too_slow midi
beginner sterile basic staccato
ponderous robotic stumbling
melancholic moving gloomy strict emotionless
sad slower careful mysterious
delicate pretty uninteresting aggressive
moderate simple machine
very_slow well_played tense clumsy static
gentle quiet modest steady chaotic
tender relaxed uninspired stuttering erratic dry
subtle cantabile story-telling flat careless unroundgalloping chopped
graceful rounded invisible accentuated powerfulhorrible dull anxious messy
calm pensive uncomfortable
professional blurred solemnexciting cut cold irregular
without_feelingelegant mobile warm joyful varied hesitant vigorous childish unfeeling
balanced thoughtful straight even indistinct
sharp
dreamy singing restrained sensitive mufflednaive sameness abrupt angry sloppy unpredictable
mellow romantic floating controlled pedaled strong loud uniform
differentiated smooth with_feeling accuratedeliberate intensiveemphatic round colorful urgent
breathy musical frenetic flow worried
old_recording not_daring melodious
very_nice hidden agitated
serene very_romantic
fluid impatient bland skipping bright hurrying
perfect energetic
not_very_dynamic reverberating
dark shallow easy articulated full_sound determined
passionate full correct vibrant sparkling
fluent airy pragmatic damped daring very_dynamic
reverberant cheerful changing hurried excited
beautiful lightweight clean brisk
distant vivid
flowing hectic

playful
pires nervous
uchida lively hasty
gulda rushed
gould too_fast
midi

Figure 7.9: Visualisation of the performance descriptor words present in the Con Espressione Dataset, projected on the first two PCA dimensions of the
word occurrence matrix. The solid coloured diamonds are predicted positions in this space of the five performances for Mozart’s Piano
Sonata No. 16 (K545), 2nd movement. The words nearest to these are highlighted in same colours. The smaller, fainter diamonds are the
ground-truth positions, meaning the centroids across the descriptor words for a performance.
7.4 probing further
118
7.5 discussion and conclusion 119

7.5 discussion and conclusion


In this chapter, we evaluated four feature sets – mid-level perceptual features,
pre-trained emotion features, low-level audio features, and score-based features
on their ability to model and predict emotion (in terms of arousal and valence) in
diverse piano performances. Specific focus was given on the three audio-based
features and their modelling power over performance-wise variation of emotion.
We observe some noteworthy trends. While score-based features are strong
predictors of valence, they do not model arousal as well as the audio-based
features. This is expected since the score based features have no information
about expressive performance-related qualities, which the audio-based features
can, in theory, extract. Such qualities are consistent with variations in arousal,
while the overall “positive” or “negative” expression, the valence, seems to be
highly correlated with the mode of a piece of music, which is codified in the
score. Among the audio-based features, mid-level features are seen to be the
most important predictors of emotion, followed by features extracted from a
deep end-to-end model, followed by low-level features, which perform poorly
across various tests, including testing for performance-wise emotion variation
modelling.
In terms of robustness as well, mid-level features stand at the top, as seen by
prediction of emotions for unseen performances with radically different emotions.
Note that the model predicting the mid-level features was originally trained on
the Mid-level Features Dataset [9], a dataset composed of very few piano pieces.
The model is domain-adapted to piano music during training. Its effectiveness
in modelling emotion even after transfer to a different domain is an evidence to
the overall “transferability” of the mid-level features. By this we mean that these
features may hold relevance in a wide variety of genres and styles of music in
addition to solo piano music, and are amenable to transfer learning via domain
adaptation. Transfer of mid-level features to other under-represented genres is
thus a potential direction of future work.
From the experiments presented in this chapter, it is clear that data-driven au-
tomatic feature extractors are a strong competition to the audio features typically
used for emotion recognition [144]. Here the importance of mid-level features
is even more pronounced – in addition to being able to model both arousal
and valence well under different conditions, they also provide intuitive musical
meaning to each feature.
The search for good features to model music emotion is a worthwhile objective
since emotional effect is a very fundamental human response to music. Features
that provide a better handle on content-based emotion recognition can have a
significant impact on applications such as search and recommendation. Modelling
emotion is also becoming increasingly relevant in generative music, allowing
possibilities such as expressivity- or emotion-based snippet continuation and
emotion-aware human-computer collaborative music.
In the next chapter, we will see how we can augment the mid-level feature
space using two additional perceptually relevant features.
8
C O M M U N I C AT E : D E C O D I N G I N T E N D E D
E M O T I O N V I A A N AU G M E N T E D M I D - L E V E L
F E AT U R E S E T

8.1 Augmenting the Mid-level Feature Space . . . . . . . . . . . . . . . 121


8.2 Decoding and Visualising Intended Emotion . . . . . . . . . . . . . 125

An idea central to this thesis, and alluded to on several occasions in earlier


chapters, is that music acts as a medium for communicating emotions typically
from the musician/performer to the listener. Expert musicians can mould a
musical piece to convey specific emotions that they intend to communicate. In
this chapter, we attempt to place our music emotion models in this performer-
to-listener communication scenario, and demonstrate via a small visualisation
music emotion decoding in real time vis-à-vis a performer’s intended emotions.
But first, in Section 8.1, in the hope of subsequently improving emotion mod-
elling via mid-level features, we look at two obvious features absent from the
set of mid-level features used till now – perceptual speed and dynamics. Given
the unavailability of good quality human ratings for these two features, we had
ignored these so far. Here, finally, we investigate approximating these two features
with hard-coded algorithms. We then investigate modelling music emotion with
this augmented set of mid-level features (Section 8.1.3), and find that these two
additional features indeed improve emotion modelling substantially. This obser-
vation provides evidence that perceptual speed and dynamics are two features
important for completion of the set of mid-level features. Then, in Section 8.2,
we use a mid-level to emotion model (trained with the augmented set of mid-
level features) to predict the emotions that a musician is trying to communicate
through music as he plays and modifies a melody according to a range of varying
intended emotions.
This chapter is a greatly expanded version of the following publication:

• S. Chowdhury and G. Widmer


Decoding and Visualising Intended Emotion in an Expressive Piano Perfor-
mance, Late-breaking Demo Session, 23rd International Society for Music Infor-
mation Retrieval Conference (ISMIR 2022), Bengaluru, India. (Under review)

120
8.1 augmenting the mid-level feature space 121

8.1 augmenting the mid-level feature space


The set of mid-level features used so far in this thesis were taken from Aljanaki
and Soleymani [9], who in turn selected the set of features from previous work
like Friberg et al. [64]. In [9], a dataset of 5000 audio clips human-annotated with
7 mid-level features was made available. We used this dataset to train our models
and build a mid-level feature extractor. The motivation for doing so stems from
our desire to train mid-level feature extractors that as much as possible conform
to how humans seem to perceive these features (as represented by actual rating
data). The seven mid-level features in the dataset were the following:

1. Melodiousness

2. Articulation

3. Rhythmic complexity

4. Rhythmic stability

5. Dissonance

6. Tonal stability

7. Modality (Minorness)

While we have shown in the earlier chapters of this thesis that this set of
seven mid-level features captures variation in music emotion surprisingly well,
two obvious features important for emotional expression – perceptual speed and
dynamics – are conspicuously missing from it [34]. Musical cues such as attack
rate and dynamics have been shown in previous experiments to contribute
significantly to emotional expression [53]. Our hypothesis is that augmenting
the mid-level feature space with these two additional features should improve
explainable emotion modelling significantly. In this section, we demonstrate the
efficacy of adding (analogues of) perceptual speed and dynamics to improve
modelling of musical emotion. These two features will be modelled in a more
direct way, based on our musical intuition rather than on empirical user perception
data.

• Perceptual Speed
Recall from Chapter 4, Section 4.1.2 that Friberg et al. [64] used the following
definition of perceptual speed in their work on perceptual features:
“Indicates the general speed of the music disregarding any deeper
analysis such as the tempo, and is easy for both musicians and
non-musicians to relate to.”
Note the distinction between perceptual speed and tempo. While tempo is
typically computed as the occurrence rate of the most prominent metrical
level (beat), perceptual speed is influenced by lower level or higher level
metrical levels as well – factors such as note density (onsets per second)
8.1 augmenting the mid-level feature space 122

seem to be important [56]. Madison and Paulin [126] find that there is a
non-linear relationship between rated perceptual speed and tempo. In actual
music (not a metronome track), a high tempo seems to be counteracted
by a lower relative event density and vice versa, resulting in a sigmoid
shape on the perceptual speed vs tempo plot, with shallower slopes for the
extreme tempo ranges compared to the middle range. They find that event
density (number of sound events per unit time) contributes substantially to
perceptual speed.

• Perceived Dynamics
Perceived dynamics refers to the perceived force or energy expended by
musicians on their instruments while performing, as inferred by a listener.
Going back again to Section 4.1.2, let us recall how dynamics was defined:
“Indicates the played dynamic level disregarding listening volume.
It is presumably related to the estimated effort of the player.”
Note the distinction between dynamics and volume – dynamics does not
only refer to the sound intensity level. As Elowsson and Friberg [57] point
out, loudness and timbre are closely related to perceived dynamics, and
spectral properties of most musical instruments change in a complex way
with performed dynamics.

8.1.1 Approximating Perceptual Speed

While learning perceptual speed in a data-driven fashion like the other mid-level
features would be ideal, most works (such as [56] and [64]) on perceptual speed
have used small, privately collected datasets for their experiments. Training large-
scale models on such small datasets is not feasible; moreover, privately collected
datasets are often not available. Therefore, we attempt to approximate or emulate
perceptual speed, by taking advantage of the observation that perceptual speed is
significantly correlated with event density [56, 126] and investigate approximating
the perceptual speed by computing onset density.
‘Onset’ refers to the beginning of a musical note or other sonic event. It is
related to (but different from) the concept of transient: all musical notes have an
onset, but do not necessarily include an initial transient. Onset detection is the
task of identifying and extracting onsets from audio. Onset density (analogous to
event density) is simply the number of onsets per unit time.
We experiment with two different onset density extraction methods. The first
is the SuperFlux method of onset detection [30], which extracts an onset strength
curve by computing the frame-wise difference of the magnitude spectrogram
(spectral flux) followed by a vibrato suppression stage. The onsets are detected
by applying a peak picking function on the onset strength curve. This is a purely
signal processing based method.
The second method is applicable for our specific context of solo piano music.
The idea is to use a piano transcription algorithm to predict the played notes
and the note onset times from an audio recording, and to obtain the onset curve
by summing over the pitch dimension. Figure 8.1 explains this process. We
8.1 augmenting the mid-level feature space 123

Transcription

RNNPianoNoteProcessor

Sum over note


start positions

Onsets Peak
Picking
Onset curve

Figure 8.1: Computation of onsets using RNNPianoNoteProcessor from madmom [29].


The output of RNNPianoNoteProcessor is a matrix of size 88 × T, where T
is the number of frames in the time axis, and the pitch axis corresponds to
the 88 notes of the piano. The start positions of the transcribed notes from
this matrix are taken and are summed over the pitch axis, giving the onset
strength curve. Onsets are then obtained from this curve using peak picking.

use the RNNPianoNoteProcessor from madmom [29] as our pre-trained piano


transcription algorithm. The output of the RNNPianoNoteProcessor is summed
in the frequency dimension to get the onset strength curve, which is then fed to a
peak picking function to obtain the onsets.
We observe that a sharper onset curve is obtained when we use the transcrip-
tion method, as compared to the SuperFlux method, and therefore we use that
approach since we are dealing with piano audio here. In a more general case
(with the audio containing different instruments), the SuperFlux method should
be used.

8.1.2 Approximating Perceived Dynamics

For perceived dynamics, again, we do not have any annotated public dataset, to
the best of our knowledge. Elowsson and Friberg [57] use a pipeline of a large
number of handcrafted low-level features to approximate performed dynamics.
In our case of solo piano music, we find that the RMS (Root-Mean-Squared)
amplitude of the audio signal is a good candidate feature that is able to capture a
significant variation in emotion, and is easy to understand (from an interpretabil-
ity perspective). We use this feature as an approximation to performed dynamics
(estimated effort of the player) for solo piano music because the relationship
between note velocity (the force with which a keyboard key is pressed) and
loudness can be assumed to be monotonic for the piano [5].
We use Librosa’s RMS function [129] to compute this
q feature. For an input audio
signal x, the RMS amplitude is given as RMSk = mean(wτ ( x )2k ), k = 1 . . . N,
where wτ (·)k is a rectangular windowing function which partitions the input
sequence into frames and returns the k-th frame of length τ, and N is the total
number of frames.
8.1 augmenting the mid-level feature space 124

Adjusted R2
Feature Set Arousal Valence

The (7)-mid-level feature set 0.68 0.63


Onset density and RMS amplitude 0.74 0.39
The (9)-mid-level feature set 0.79 0.65

Table 8.1: Performance of the different feature sets on modelling arousal and valence of
the Bach WTC Book 1 dataset.

8.1.3 Modelling Emotion Using the Augmented Mid-level Feature Space

For our present case, we consider the two newly added features (onset density
and RMS amplitude) as a part of the “mid-level feature set”. While technically
these two features are computed using low-level algorithms instead of being
learned from data, we still consider them under the ambit of “mid-level” for the
purposes of this chapter, since we treat them as approximations of perceived
speed and perceived dynamics. To distinguish between the original set of seven
mid-level features, and the new augmented feature set of nine features, we will
call them (7)-mid-level features and (9)-mid-level features, respectively, in this
chapter.
To evaluate the effect of adding the additional features to our original set of
mid-level features, we use the Bach Well-Tempered Clavier Book 1 (WTC) dataset
from Chapter 7. Remember that the dataset contains recordings of the first eight
bars of all 48 pieces of the WTC Book 1 performed by 6 different pianists, for
a total of 288 recordings. We perform the regression based evaluation as done
previously in Section 7.3.2. First, we predict the original (7)-mid-level features
for the 288 Bach recordings using a domain adapted RF-ResNet model, as was
done in Section 7.3.2. We then compute the mean onset densities and mean RMS
amplitudes for each of the recordings using the approximations mentioned above,
giving us the (9)-mid-level feature set for the Bach data. The effectiveness of
this feature set in modelling emotion is evaluated by fitting a multiple linear
regression model with nine inputs and two outputs (for arousal and valence). We
look at the adjusted R2 -score as the metric. This is compared to the case where
only the original seven (7)-mid-level features are used, and where only the two
newly added features are used. The results are tabulated in Table 8.1.
Firstly, we note that while onset density and RMS amplitude alone cannot
predict valence to a good extent, using just the onset density and the RMS
amplitude for arousal prediction gives a better fit than the original (7)-mid-level
feature set. For both arousal and valence, using the combined (9)-mid-level feature
set gives the best result.
We also look at the absolute value of the t-statistic, shown in Figure 8.2, to
evaluate the relative feature importance values. We see that for arousal, onset
density is the most important feature, followed by RMS amplitude. Among the
original (7)-mid-level features, the top-3 are rhythm stability, rhythm complexity,
and melodiousness, which were also the top-3 features in Section 7.3.2. For
8.2 decoding and visualising intended emotion 125

T-statistic values for Arousal T-statistic values for Valence


Original mid-level features 8 Original mid-level features
8 Added features Added features

absolute value of T-statistic

absolute value of T-statistic


7
6
6
5
4
4
3
2 2
1
ity
rhy m_ ms

y
lod xity

mi ess

y
dis ness
al_ ce
art bility
ion

set ness

al_ ity
dis ility
art ance
lod tion
ess

rhy mpl s
_st y
_co rm
_co bilit

thm exit

ilit
ton an
ns

ton ns
r

lat
sn

sn
b

ab
me mple

me cula
r

r
son

son
_de

_de
thm sta

sta

sta
no

no
icu
iou

iou
mi

i
set

h
on

on

thm
t
rhy

rhy
(a) T-values for Arousal (b) T-values for Valence

Figure 8.2: Feature importance for the original set of seven mid-level features, and the
newly added features in this chapter.

valence, the most important feature according to the t-statistic is minorness,


followed by onset density and tonal stability.
The high t-values of onset density and RMS amplitude for arousal is not
surprising, and it highlights the importance of perceived speed and perceived
dynamics features for modelling music emotion. Combining these two features
with the original mid-level features gives a good improvement on overall arousal
modelling, with the adjusted R2 -score being better than the two feature sets alone.
For valence, the (7)-mid-level features are still important as adding the two new
features results in a small improvement in the adjusted R2 -score.
Our findings point to a very pertinent future direction of work on learning
perceived speed and perceived dynamics from actual human ratings as a means
to improve music emotion recognition algorithms and to equip such models with
better explanatory capacity grounded on human perception.

8.2 decoding and visualising intended emotion


In all of our experiments so far, we have dealt with perceived emotion – we analysed
a given piece of music and attempted to predict what emotions a listener might
perceive from it. The intentions of the performer or composer were not considered
in those cases. For example, in Chapter 7, we considered different performances
of the same piece of music by different pianists, yet it is not known if the
pianists intended to convey any particular emotion through their interpretations.
However, there are cases where the intended emotion may be available – for
instance in the experiments by Gabrielsson and Juslin [68], where the authors
presented evidence that performers are able to communicate intended emotions to
listeners with high accuracy. In Akkermans et al. [8], four experienced performers
– a flutist, a pianist, a violinist, and a vocalist – recorded three melodies with
the intention of conveying seven different emotions: “angry”, “expressionless”,
“fearful”, “happy”, “sad”, “solemn”, and “tender”. Crowd-sourced ratings of
decoded emotion showed that although the decoding accuracy varies by emotion
8.2 decoding and visualising intended emotion 126

and instrument, it is generally significantly better than random (by decoded


emotion, we mean the emotion perceived by listeners, as opposed to the encoded
emotion, which is the intended emotion of a performer).
Such a scenario, where we have information about what emotion a musician
intended to communicate to listeners with their playing, allows us to view our
mid-level based emotion model in a different light. Instead of comparing model
predictions with listener ratings of perceived emotion, we could compare the
predictions directly to the intended emotion. In this section, we demonstrate this
direct comparison on a YouTube video of a musician (Jacob Collier) playing a
piece and modifying his playing style in real time to convey specific emotions. We
use our mid-level features based emotion model to predict the emotion from the
audio continuously and visualise it on Russell’s circumplex [159]. Note that this is
a demonstration of our model, and not a detailed analysis of model performance.

8.2.1 Setup of the Demonstration

In a YouTube video [171], multiple Grammy Award winning musician Jacob


Collier plays the piece “Danny Boy” (or “Londonderry Air”) on a piano and
modifies it according to different emotions shown to him while he is playing.
The full video contains three tiers of emotions of increasing complexity. For our
demonstration, we consider tier one. The emotions in this tier are “happy”, “sad”,
“angry”, “mysterious”, “triumphant”, and “serene”. He also plays the melody in
a “neutral” manner before the first emotion is shown.
We extract the sound from this video and predict the arousal and valence values
at 1-second intervals using our emotion model (having a 5-second window length).
The predicted arousal and valence values are plotted on Russell’s circumplex
[159] and an animation of the plot is overlaid with the video. To smooth the
animation, the values between the actual 1-second-apart predicted values are
obtained using exponential interpolation. This video, with the overlaid emotion
plot, can be accessed here: 0 .

8.2.2 The Emotion Model

Since Jacob plays and modifies the song continuously in real time, we wish to
predict the emotions in a dynamic fashion. This will let us visualise how the
model reacts to changes in playing style as the intended emotion changes. In
other words, we wish to perform dynamic emotion recognition, instead of static
emotion recognition. For dynamic emotion recognition, the audio is split into
windows and emotions are predicted for each window. The smaller the window
size, the quicker the model outputs react to changes in the performance.
We will build our emotion model in a manner similar to the model that we
used for analysis in Section 8.1. To recall the full pipeline, the steps are:

1. Train a (7)-mid-level model using the Mid-level Features dataset, with


domain adaptation for piano music (as described in Chapter 6).
8.2 decoding and visualising intended emotion 127

2. Use this model to predict (the original seven) mid-level feature values for
the 288 recordings in the Bach WTC Book 1 dataset.

3. Compute onset density and RMS amplitude for the 288 recordings in the
Bach WTC Book 1 dataset.

4. Train a mid-level to emotion model on the Bach WTC Book 1 dataset, using
(9)-mid-level features as inputs and the arousal/valence annotations as
outputs (we use Multiple Linear Regression (MLR) with nine inputs and
two outputs).

However, recall that the mid-level models in previous chapters were trained
on input spectrograms of length 15 seconds, which is too long of a window
for the model to react to quick changes in emotion in our present audio. We
therefore experiment with training mid-level feature models with smaller input
audio lengths. The training performance with different input lengths is shown
in Figure 8.3. We choose a 5-second window for our final model as a reasonable
trade-off between prediction accuracy and window size. This model is then used
for step 1 above. The rest of the steps remain the same.

0.68
0.65
Average Correlation Coeff

0.62
0.60
0.57
0.55
0.53
0.50
0.48
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Input audio length (seconds)
Figure 8.3: Mid-level feature model performance with respect to input audio length.

8.2.3 Results and Discussion

Six frames from the video together with the predictions are shown in Figure 8.5
(and continued in Figure 8.6). We can see that the predicted emotions match
closely with the intended emotions (“Jacob’s Emotions”). Note that the frames
shown here are captured at times when the predicted emotion and intended
emotion come closest visually (for those emotions that are present on Russell’s
circumplex, such as “happy”, “sad”, “anger” and “serene”). The full trace of the
prediction point is shown in Figure 8.4b.
We also obtain static emotions – the audio sections corresponding to each
of the seven emotions are cut out and used as individual input audio files for
8.2 decoding and visualising intended emotion 128

the model. In this case, we use our standard input length (15-second) mid-level
feature model, with the input audio being looped if it is less than 15 seconds,
and the predictions for successive windows with a 5-second hop averaged if it is
more than 15 seconds. These static emotion predictions are shown on Figure 8.4a,
where the predicted points are annotated with the intended emotion for each.
The visualisation experiment presented in this section serves as an interesting
proof-of-concept for further, more rigorous, experiments on decoding intended
emotions using computer systems. We can see that a simple linear regression
model, with a handful of mid-level features as inputs (7 original plus 2 new),
trained on a small dataset of 288 Bach WTC examples, is able to predict the
intended emotions for a markedly different set of performances in a fairly satis-
factory manner. This points to the robustness of the (9)-mid-level features and of
our (7)-mid-level feature model, and to the impressive capacity of these features
to reflect encoded music emotion. The full demonstration video can be found
here: 0 .
8.2 decoding and visualising intended emotion 129

100
alarmed aroused
tense astonished
afraid angry excited
80 annoyed
distressed angry
frustrated triumphant
delighted
happy
60
mysterious happy
Arousal

miserable pleased
neutral
40 sad
sad
depressed
gloomy serene
serene content
at ease
20 bored satisfied
relaxed
calm
droopy
tiredsleepy
0
4 3 2 1 0 1 2 3 4
Valence

(a) Static emotion prediction. The predicted emotions are marked


with dark text, and the emotion words of Russell’s circumplex are
marked with light text.
100
alarmed aroused neutral
afraid
tense astonished
angry
happy
excited sad
annoyed angry
80 distressed mysterious
frustrated triumphant
delighted
serene
60
happy
Arousal

miserable pleased
40
sad
depressed
gloomy serene
20 at content
ease
bored satisfied
relaxed
calm
droopy
tiredsleepy
0
4 2 0 2 4
Valence

(b) Full trace of dynamic emotion prediction. Jacob’s intended emo-


tions (according to the notated emotion in the original video)
are depicted with different colours, and the passage of time is
depicted with the shade – from lightest to darkest.

Figure 8.4: Static and dynamic emotion prediction for Jacob Collier’s performance of
Danny Boy according to seven emotions: “neutral”, “happy”, “sad”, “angry”,
“mysterious”, “triumphant”, and “serene”.
8.2 decoding and visualising intended emotion 130

Figure 8.5: Screenshots during different times of Jacob Collier’s performance video,
overlaid with the corresponding predicted emotions.
8.2 decoding and visualising intended emotion 131

Figure 8.6: Screenshots during different times of Jacob Collier’s performance video,
overlaid with the corresponding predicted emotions.
9
CONCLUSION AND FUTURE WORK

9.1 conclusion
In this thesis, we set out with the goal of investigating the problem of music
emotion recognition (from audio recordings) through the lens of interpretability
(or explainability) by using perceptually relevant musical features. To this end,
we first proposed a bottleneck model that is trained using perceptual mid-level
features and music emotion labels (Section 4.2). We trained a deep model to
predict mid-level features – “melodiousness”, “articulation”, “rhythm stability”,
“rhythm complexity”, “dissonance”, “tonal stability”, and “minorness” – as an
intermediate layer (the bottleneck), from which the final emotion values were
then predicted. The mid-level features as well as the emotions were learned
from human annotated datasets. The mid-level to emotion model was made
explainable by virtue of the interpretability of the features themselves, and
by using a linear model that predicted emotion from mid-level features. We
explained the predictions in terms of the learned weights, as well as the effect
of each mid-level feature on the output value of a particular emotion prediction
(Section 4.5).
Next, we introduced two approaches to explain the part of the model between
the audio (spectrogram) inputs and the mid-level bottleneck layer using two
methods. The first was to explain mid-level feature predictions by training a
surrogate linear model using LIME (Local Interpretable Model-agnostic Expla-
nations) and using this to indicate important patches in the input spectrogram
(Section 5.4), which could also be transformed back to (low-quality) audio. The
second approach used audioLIME to explain mid-level predictions using an
interpretable decomposition of the input audio into its musical sources (the audio
track is split into five instrument components: vocals, piano, bass, drums, and
other) (Section 5.5).
Equipped with mid-level features for predicting and explaining music emotion,
we then turned to modelling emotional variation in piano performances. In order
to maintain model validity for solo piano music, we proposed an unsupervised
domain adaptation and refinement pipeline to transfer mid-level feature models
to the piano domain. We used the well-known “unsupervised domain adaptation
using backpropagation” approach to learn domain invariant feature spaces, and
introduced a self-training based refinement stage to further improve performance
on piano music (Chapter 6).

132
9.2 future work 133

To study emotional modelling in piano music, we used a dataset of perfor-


mances of Bach’s Well-Tempered Clavier Book 1, annotated with arousal and
valence values. We investigated different feature sets – low-level features, score-
based features, mid-level features, and features from a pre-trained deep emotion
regressor – on their capacity for emotional modelling across different splits, in-
cluding emotion variation between performances of the same piece. We found
that mid-level features explain the variation of emotion between performances
the best, including when “outlier” performances are excluded from the training
set and used as a test set (Chapter 7).
The efficacy of mid-level features in predicting emotions in a robust and
transferable manner motivated us to look for additional features to augment
the mid-level feature space. We looked at onset density and RMS amplitude as
approximations for perceptual speed and dynamics, both of which are considered
important aspects of expressive music performance. We found that adding these
two features to the mid-level space improved emotion modelling significantly
(Section 8.1).
As a final demonstration of mid-level features based emotion prediction, we
used the augmented set of mid-level features to predict and visualise the intended
emotions of a performer modifying a piece in real time according to different
emotions. We found impressive correspondence between the emotions that the
performer attempted to communicate through music and the emotions that our
system decoded from the music (Section 8.2).

9.2 future work


While we obtained some interesting results in this thesis, there are several ways
in which our methods and analyses can be improved. For instance, testing dif-
ferent domain adaptation methods [119] for musical features systematically, and
applying them for different kinds of domain shifts will be helpful for establishing
a baseline for the domain shift problem in the context of music and possible ways
to mitigate it.
Another way to improve upon the findings of this thesis is by collecting
more emotion data for a diverse set of music from a larger pool of listeners. In
Chapter 7, we used a relatively small dataset of emotion annotated recordings
(the 288 performances of Bach’s Well-Tempered Clavier) for our analysis. With
a larger dataset consisting of more number of composers and performers, the
question of piece-related vs performance-related expressive qualities of music
could potentially be answered more definitively.
A research direction that is highlighted in this thesis is that of human-friendly
explanations in machine learning models applied to music. Over the past few
years, there has been some progress on this front, for example attention-based
explanations for audio tagging [178], concept-based methods for musically rele-
vant explanations [61], and explanations applied to adversarial examples [153].
An extension of mid-level feature based explanations could possibly incorporate
“musician-friendly” or “listener-friendly” concepts, depending on the application,
for generating context-specific and meaningful explanations. As we have already
9.2 future work 134

seen in Chapter 8, additional features such as perceptual speed and perceived


dynamics are important predictors and explanatory variables for music emo-
tion, and thus, collecting large scale human annotations for these features across
different genres would enable a data-driven approach to modelling such features.
Such conceptually intuitive features can also be useful for steerable human-AI
collaborative music production and songwriting. Huang et al. [92] highlight the
importance of steerability in AI-assisted music making, but note that presently
available methods lack such steerability. They gathered comments from teams
participating in the AI Song Contest [50], and reported comments from some of
the teams. One team, as reported by Huang et al. [92], mentioned their desire to
make a generated section of music sound “darker” and a cadence to sound more
“elaborate”, which they achieved by rewriting the section multiple times. Instead,
a knob for tweaking perceptual parameters such as brightness, harmonic complexity,
and rhythm complexity might have made this process more intuitive.
Overall, we see many potential avenues of future work in perceptually informed
analysis of music. We hope that this thesis and its contributions will inspire new
researchers and practitioners in this area. We envision an exciting future for
“AI+music” and look forward to musical questions that machine learning may
help us explore and answer.
Part III

APPENDIX
a
D ATA S E T S U S E D I N T H I S T H E S I S

a.1 The Mid-level Features Dataset . . . . . . . . . . . . . . . . . . . . . 136


a.2 The Soundtracks Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 137
a.3 The PMEmo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
a.4 The MAESTRO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.5 The DEAM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.6 The Con Espressione Dataset . . . . . . . . . . . . . . . . . . . . . . 140

Several datasets have been used across the different chapters of this thesis. The
present appendix serves as a quick reference for all those datasets at one place.
A summary of the datasets are given below (Table a.1), and details about each
dataset is provided in the following sections, in the order of their appearance in
the thesis.

dataset audio annotations used in

Mid-level Features [9] Yes Ratings for 7 mid-level features Chapters 4, 5, 6, 7 and 8
Soundtracks [54] Yes Ratings for 8 emotions Chapter 4
PMEmo [184] Yes Ratings for arousal and valence Chapter 5
MAESTRO [85] Yes None Chapter 6
DEAM [10] Yes Ratings for arousal and valence Chapter 7
Con Espressione [37] Yes* Free-text descriptions Chapters 6, 7 and 8

Table a.1: Summary of the datasets used in this thesis. The annotations column mentions
which type of annotations were used in the thesis for each dataset. *The audio
files for the Con Espressione dataset are not distributed in the released version
of the dataset due to copyright reasons.

a.1 the mid-level features dataset


The Mid-level Perceptual Features Dataset [9] consists of 5000 song snippets of
around 15 seconds each annotated according to seven mid-level descriptors:
melodiousness, articulation, rhythm stability, rhythm complexity, dissonance, tonal
stability, and modality (or ‘minorness’). The song snippets come from five sources:
Jamendo (www.jamendo.com), Magnatune (magnatune.com), the Soundtracks
dataset [54], the Bi-modal Music Emotion dataset [128], and the Multi-modal

136
a.2 the soundtracks dataset 137

Music Emotion dataset [143]. No more than five songs from the same artist
were allowed to be present in the dataset. The ratings for the seven mid-level
perceptual features, as defined in [9], were collected through crowd-sourcing.
To help the human participants interpret the mid-level concepts, the mid-level
features were described in the form of questions, as reproduced below (the ratings
were collected in a pairwise comparison scenario).

1. Melodiousness: To which excerpt do you feel like singing along?

2. Articulation: Which has more sounds with staccato articulation?

3. Rhythm Stability: Imagine marching along with the music. Which is easier
to march along with?

4. Rhythm Complexity: Is it difficult to repeat by tapping? Is it difficult to


find the meter? Does the rhythm have many layers?

5. Dissonance: Which excerpt has noisier timbre? Has more dissonant inter-
vals (tritones, seconds, etc.)?

6. Tonal Stability: Where is it easier to determine the tonic and key? In which
excerpt are there more modulations?

7. Modality (‘Minorness’): Imagine accompanying this song with chords.


Which song would have more minor chords?

For obtaining ratings, Aljanaki and Soleymani [9] first used pairwise compar-
isons to get rankings for a small subset of the dataset, which was then used to
create an absolute scale on which the whole dataset was then annotated. The
annotators were required to have some musical education and were selected
based on passing a musical test. The ratings range from 1 to 10.

Mid-level Features dataset at a glance

Number of excerpts 5000


Length of each excerpt 15 seconds
Annotations Ratings for 7 mid-level features
Total audio duration 20.9 hours
URL https://osf.io/5aupt/

a.2 the soundtracks dataset


The Soundtracks (Stimulus Set 1) dataset [54], consists of 360 audio excerpts from
110 movie soundtracks. The soundtracks were chosen based on their representa-
tiveness for each emotion – half of the soundtracks were moderately to highly
representative of five discrete emotions (anger, fear, sadness, happiness, and ten-
derness), and the other half were moderate to high examples of the six extremes
of the three dimensions of the valence-energy-tension model of emotion (see
a.3 the pmemo dataset 138

Section 2.3.2 “Schimmack and Grob model of emotion”). All the excerpts were
rated by 116 non-musicians on all eight perceived emotions (anger, fear, sadness,
happiness, tenderness, valence, energy, and tension) on a scale of 1-9.

Soundtracks dataset at a glance

Number of excerpts 360


Annotations Static emotion ratings (5 discrete + 3 dimensional
emotions)
Total audio duration 1.7 hours
URL https://www.jyu.fi/hytk/fi/laitokset/mutku/en/
research/projects2/past-projects/coe/materials/
emotion/soundtracks

a.3 the pmemo dataset


The PMEmo dataset (“Popular Music with Emotional annotations”) [184] consists
of 794 audio clips of chorus sections of popular songs and emotion annotations
(static and dynamic ratings of arousal and valence) for each clip. The dataset
also provides electrodermal activity of the raters captured during the annotation
process. For this thesis, we only use the static arousal and valence annotations.
The songs were selected from well-known music charts: the Billboard Hot 1001
(19th week 2016 – 23rd week 2017), the iTunes Top 100 Songs (USA)2 (15th week
2016 – 21st week 2017) and the UK Top 40 Singles Chart3 (37th week 2016 – 21rd
week 2017). A total of 457 subjects (236 females and 221 males) were recruited
as participants for obtaining the emotion ratings, which included non-musicians
as well as students majoring in music. Each song received a total of at least
10 emotion annotations including one by a music-majoring student and one by
an English speaker. The annotations were done using an interface with a slider
collecting dynamic annotations, on a scale from 1 to 9, at a sampling rate of 2
Hz. The static annotations were done on a nine-point scale after the dynamic
annotations for each song finished. The participants heard the songs twice, first
for rating valence and then for rating arousal.

PMEmo dataset at a glance

Number of excerpts 794


Annotations Static arousal/valence ratings; Dynamic arousal/va-
lence ratings; Electrodermal activity
Total audio duration 8.4 hours
URL https://github.com/HuiZhangDB/PMEmo

1 https://y.qq.com/n/yqq/toplist/108.html
2 https://y.qq.com/n/yqq/toplist/123.html
3 https://y.qq.com/n/yqq/toplist/107.html
a.4 the maestro dataset 139

a.4 the maestro dataset


The MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organi-
zation) Dataset [85] is comprised of audio recordings and corresponding MIDI
tracks of close to 200 hours of piano performance. These recordings were recorded
over ten years of the International Piano-e-Competition4 , with virtuoso pianists
playing mostly Western classical pieces from the 17th to early 20th century. The
pianists perform on a Yamaha Disklavier, which is a real acoustic piano fitted
with electronic sensors for capturing high resolution and accurate note and pedal
information. In the present thesis, only the audio part of this dataset is relevant.
The audio is available as .WAV files (44.1 – 48 kHz 16-bit PCM stereo). We use
version 2.0.0 of the dataset.

MAESTRO Dataset v2.0.0 at a glance

Performances 1184
Compositions (approx.) 430
Total audio duration 172.3 hours
URL https://magenta.tensorflow.org/datasets/maestro

a.5 the deam dataset


The Database for Emotional Analysis in Music (DEAM) consists of music from
freemusicarchive.org (FMA), jamendo.com, and the medleyDB dataset [28]. There
are a total of 1802 music samples, out of which 1744 are 45-second snippets, and
58 are full-length songs. The 45-second excerpts are extracted from randomly
selected starting points in a given song.
The music was annotated with dynamic values of arousal and valence by
human annotators using a crowd-sourcing platform, captured at a sampling rate
of 2 Hz. For each emotion (arousal or valence) of each song/snippet, annotations
from multiple raters were combined and the combined data was used to fit
a generalised additive mixed model (GAM) [130], which provided the final
dynamic arousal/valence values for that song/snippet. The average rating for
each song/snippet and the standard deviations of ratings are also available as
static annotations. The dynamic annotations are between -1 and +1 and exclude
the first 15 seconds due to instability of the annotations at the start of the clips.

DEAM dataset at a glance

Number of excerpts 1802


Annotations Dynamic arousal/valence ratings; Static arousal/valence ratings
Total audio duration 25.6 hours
URL https://cvml.unige.ch/databases/DEAM/

4 https://piano-e-competition.com/
a.6 the con espressione dataset 140

composer piece # pianists

Bach Prelude No.1 in C, BWV 846 (WTC I) 7 Gieseking, Gould, Grimaud, Kempff, Richter,
Stadtfeld, MIDI
Mozart Piano Sonata K.545 C major, 2nd mvt. 5 Gould, Gulda, Pires, Uchida, MIDI
Beethoven Piano Sonata Op.27 No.2 C# minor, 1st mvt. 6 Casadesus, Lazić, Lim, Gulda, Schiff, Schirmer
Schumann Arabeske Op.18 C major (excerpt 1) 4 Rubinstein, Schiff, Vorraber, Horowitz
Schumann Arabeske Op.18 C major (excerpt 2) 4 Rubinstein, Schiff, Vorraber, Horowitz
Schumann Kreisleriana Op.16; 3. Sehr aufgeregt (ex. 1) 5 Argerich, Brendel, Horowitz, Vogt, Vorraber
Schumann Kreisleriana Op.16; 3. Sehr aufgeregt (ex. 2) 5 Argerich, Brendel, Horowitz, Vogt, Vorraber
Liszt Bagatelle sans tonalité, S.216a 4 Bavouzet, Brendel, Katsaris, Gardon
Brahms 4 Klavierstücke Op.119, 2. Intermezzo E minor 5 Angelich, Ax, Serkin, Kempff, Vogt

Table a.2: Performances used in the Con Espressione Game, as described in Cancino-
Chacón et al. [37]

a.6 the con espressione dataset


The Con Espressione Dataset was constructed using the responses gathered
from the Con Espressione Game5 , wherein participants listened to extracts from
recordings of selected solo piano pieces (by composers such as Bach, Mozart,
Beethoven, Schumann, Liszt, Brahms) by a variety of different famous pianists
and were asked to describe, in free-text format, the expressive character of each
performance. There were 45 performances of 9 excerpts (see Table a.2), with the
length of the excerpts being between 27 seconds and 188 seconds. The online
questionnaire, where participants could enter their answers, was filled out by 194
participants, out of which 88% had some kind of music education – on average
11.7 years; 179 participants answered in English, 12 in German and three in
each of Russian, Spanish and Italian. On average, participants listened to the
performances of 4.5 out of 9 pieces, 27 participants listened to all the 45 excerpts.
Typical characterisations that came up were adjectives like “cold”, “playful”,
“dynamic”, “passionate”, “gentle”, “romantic”, “mechanical”, “delicate”, etc.
The dataset, compiled using the responses from the game, consists of a total of
1,515 individual responses, and scores in MusicXML format (including score-to-
performance alignments for all performances in the dataset).

Con Espressione Dataset at a glance

Number of excerpts 45
Total audio duration 1.0 hour
Number of responses 1515
Total terms 3166
Unique terms 1415
URL https://cpjku.github.io/con espressione game ismir2020/

5 con-espressione.cp.jku.at
BIBLIOGRAPHY

[1] Jakob Abeßer and Meinard Müller. “Towards Audio Domain Adaptation
for Acoustic Scene Classification using Disentanglement Learning.” In:
arXiv preprint arXiv:2110.13586 (2021).
[2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal
Fua, and Sabine Süsstrunk. “SLIC superpixels compared to state-of-the-art
superpixel methods.” In: IEEE transactions on pattern analysis and machine
intelligence 34.11 (2012), pp. 2274–2282.
[3] Amina Adadi and Mohammed Berrada. “Peeking Inside the Black-Box:
A Survey on Explainable Artificial Intelligence (XAI).” In: IEEE Access 6
(2018), pp. 52138–52160.
[4] Kyle Adams. “On the metrical techniques of flow in rap music.” In: Music
Theory Online 15.5 (2009).
[5] Alexander Adli, Zensho Nakao, and Yanunori Nagata. “Calculating the
expected sound intensity level of solo piano sound in MIDI file.” In: SCIS
& ISIS SCIS & ISIS 2006. Japan Society for Fuzzy Theory and Intelligent
Informatics. 2006, pp. 731–736.
[6] Darius Afchar, Alessandro B Melchiorre, Markus Schedl, Romain Hen-
nequin, Elena V Epure, and Manuel Moussallam. “Explainability in Music
Recommender Systems.” In: arXiv preprint arXiv:2201.10528 (2022).
[7] Darius Afchar, Alessandro B. Melchiorre, Markus Schedl, Romain Hen-
nequin, Elena V. Epure, and Manuel Moussallam. “Explainability in Music
Recommender Systems.” In: ArXiv abs/2201.10528 (2022).
[8] Jessica Akkermans, Renee Schapiro, Daniel Müllensiefen, Kelly Jakubowski,
Daniel Shanahan, David Baker, Veronika Busch, Kai Lothwesen, Paul
Elvers, Timo Fischinger, et al. “Decoding emotions in expressive music
performances: A multi-lab replication and extension study.” In: Cognition
and Emotion 33.6 (2019), pp. 1099–1118.
[9] Anna Aljanaki and Mohammad Soleymani. “A Data-driven Approach to
Mid-level Perceptual Musical Feature Modeling.” In: Proceedings of the 19th
International Society for Music Information Retrieval Conference, ISMIR 2018,
Paris, France. 2018, pp. 615–621.
[10] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. “Developing
a Benchmark for Emotional Analysis of Music.” In: PloS one 12.3 (2017).
[11] Pedro Álvarez, A Guiu, José Ramón Beltrán, J Garcı́a de Quirós, and
Sandra Baldassarri. “DJ-Running: An Emotion-based System for Recom-
mending Spotify Songs to Runners.” In: icSPORTS. 2019, pp. 55–63.

141
bibliography 142

[12] André Araujo, Wade Norris, and Jack Sim. Computing Receptive Fields
of Convolutional Neural Networks. 2019. doi: 10.23915/distill.00021. url:
https://distill.pub/2019/computing-receptive-fields.
[13] Hussain-Abdulah Arjmand, Jesper Hohagen, Bryan Paton, and Nikki S
Rickard. “Emotional responses to music: Shifts in frontal brain asymmetry
mark periods of musical change.” In: Frontiers in psychology 8 (2017),
p. 2044.
[14] Alejandro Barredo Arrieta et al. “Explainable Artificial Intelligence (XAI):
Concepts, Taxonomies, Opportunities and Challenges toward Responsible
AI.” In: ArXiv abs/1910.10045 (2020).
[15] Taichi Asami, Ryo Masumura, Yoshikazu Yamaguchi, Hirokazu Masataki,
and Yushi Aono. “Domain Adaptation of DNN Acoustic Models using
Knowledge Distillation.” In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5185–5189.
[16] Jean-Julien Aucouturier, Francois Pachet, et al. “Music similarity measures:
What’s the use?” In: Ismir. 2002, pp. 13–17.
[17] Laura-Lee Balkwill, William Forde Thompson, and RIE Matsunaga. “Recog-
nition of emotion in Japanese, Western, and Hindustani music by Japanese
listeners 1.” In: Japanese Psychological Research 46.4 (2004), pp. 337–349.
[18] Eugene Y Bann and Joanna J Bryson. “The conceptualisation of emotion
qualia: Semantic clustering of emotional tweets.” In: Computational models
of cognitive processes: Proceedings of the 13th neural computation and psychology
workshop. World Scientific. 2014, pp. 249–263.
[19] Mathieu Barthet, György Fazekas, and Mark Sandler. “Music emotion
recognition: From content-to context-based models.” In: International sym-
posium on computer music modeling and retrieval. Springer. 2012, pp. 228–
252.
[20] Aimee Battcock and Michael Schutz. “Acoustically expressing affect.” In:
Music Perception: An Interdisciplinary Journal 37.1 (2019), pp. 66–91.
[21] Aimee Battcock and Michael Schutz. “Individualized interpretation: Ex-
ploring structural and interpretive effects on evaluations of emotional
content in Bach’s Well Tempered Clavier.” In: Journal of New Music Research
50.5 (2021), pp. 447–468.
[22] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando
Pereira, and Jennifer Wortman Vaughan. “A theory of learning from
different domains.” In: Machine learning 79.1 (2010), pp. 151–175.
[23] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. “Anal-
ysis of representations for domain adaptation.” In: Advances in neural
information processing systems 19 (2006).
[24] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. “Impossibility theo-
rems for domain adaptation.” In: Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics. JMLR Workshop and Con-
ference Proceedings. 2010, pp. 129–136.
bibliography 143

[25] J. de Berardinis, A. Cangelosi, and E. Coutinho. “The Multiple Voices


of Musical Emotions: Source Separation for Improving Music Emotion
Recognition Models and Their Interpretability.” In: Proceedings of the 21st
International Society for Music Information Retrieval Conference (2020), pp. 310–
217.
[26] Patricia EG Bestelmeyer, Sonja A Kotz, and Pascal Belin. “Effects of emo-
tional valence and arousal on the voice perception network.” In: Social
cognitive and affective neuroscience 12.8 (2017), pp. 1351–1358.
[27] Umang Bhatt, Adrian Weller, and José M. F. Moura. “Evaluating and
Aggregating Feature-based Model Explanations.” In: Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI
2020. Ed. by Christian Bessiere. ijcai.org, 2020, pp. 3016–3022. doi: 10 .
24963/ijcai.2020/417.
[28] Rachel M Bittner, Justin Salamon, Mike Tierney, Matthias Mauch, Chris
Cannam, and Juan Pablo Bello. “Medleydb: A multitrack dataset for
annotation-intensive mir research.” In: ISMIR. Vol. 14. 2014, pp. 155–160.
[29] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Ger-
hard Widmer. “madmom: a new Python Audio and Music Signal Process-
ing Library.” In: Proceedings of the 24th ACM International Conference on
Multimedia. Amsterdam, The Netherlands, Oct. 2016, pp. 1174–1178. doi:
10.1145/2964284.2973795.
[30] Sebastian Böck and Gerhard Widmer. “Maximum filter vibrato suppres-
sion for onset detection.” In: Proc. of the 16th Int. Conf. on Digital Audio
Effects (DAFx). Maynooth, Ireland (Sept 2013). Vol. 7. 2013, p. 4.
[31] Diana Boer and Amina Abubakar. “Music listening in families and peer
groups: benefits for young people’s social cohesion and emotional well-
being across four cultures.” In: Frontiers in psychology 5 (2014), p. 392.
[32] Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati,
Herrera Boyer, Oscar Mayor, et al. “Essentia: An audio analysis library for
music information retrieval.” In: International Society for Music Informa-
tion Retrieval (ISMIR). 2013, pp. 493–498.
[33] Margaret M Bradley, Mark K Greenwald, Margaret C Petry, and Peter
J Lang. “Remembering pictures: pleasure and arousal in memory.” In:
Journal of experimental psychology: Learning, Memory, and Cognition 18.2
(1992), p. 379.
[34] Roberto Bresin and Anders Friberg. “Emotion rendering in music: range
and characteristic values of seven musical variables.” In: Cortex 47.9 (2011),
pp. 1068–1081.
[35] Violet A Brown. “An introduction to linear mixed-effects modeling in
R.” In: Advances in Methods and Practices in Psychological Science 4.1 (2021),
p. 2515245920960351.
bibliography 144

[36] Carlos E Cancino-Chacón, Maarten Grachten, Werner Goebl, and Gerhard


Widmer. “Computational Models of Expressive Music Performance: A
Comprehensive and Critical Review.” In: Frontiers in Digital Humanities 5
(2018), p. 25.
[37] Carlos Cancino-Chacón, Silvan Peter, Shreyan Chowdhury, Anna Aljanaki,
and Gerhard Widmer. “On the Characterization of Expressive Performance
in Classical Music: First Results of the Con Espressione Game.” In: Proceed-
ings of the 21st International Society for Music Information Retrieval Conference
(ISMIR). 2020.
[38] Baptiste Caramiaux and Marco Donnarumma. “Artificial intelligence in
music and performance: a subjective art-research inquiry.” In: Handbook of
Artificial Intelligence for Music. Springer, 2021, pp. 75–95.
[39] Oscar Celma, Perfecto Herrera, and Xavier Serra. “Bridging the music
semantic gap.” In: (2006).
[40] Sanga Chaki, Pranjal Doshi, Priyadarshi Patnaik, and Sourangshu Bhat-
tacharya. “Attentive RNNs for Continuous-time Emotion Prediction in
Music Clips.” In: AffCon@ AAAI. 2020, pp. 36–46.
[41] Tony Chan and Luminita Vese. “An active contour model without edges.”
In: International conference on scale-space theories in computer vision. Springer.
1999, pp. 141–151.
[42] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li.
“Mode Regularized Generative Adversarial Networks.” In: arXiv preprint
arXiv:1612.02136 (2016).
[43] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and
Jonathan K Su. “This looks like that: deep learning for interpretable image
recognition.” In: Advances in neural information processing systems 32 (2019).
[44] Zhengyang Chen, Shuai Wang, and Yanmin Qian. “Self-supervised learn-
ing based domain adaptation for robust speaker verification.” In: ICASSP
2021-2021 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP). IEEE. 2021, pp. 5834–5838.
[45] Shreyan Chowdhury, Andreu Vall, Verena Haunschmid, and Gerhard
Widmer. “Towards Explainable Music Emotion Recognition: The Route via
Mid-level Features.” In: Proceedings of the 20th International Society for Music
Information Retrieval Conference, ISMIR 2019. 2019. isbn: 9781732729919.
eprint: 1907.03572.
[46] Deryck Cooke. “The language of music.” In: (1959).
[47] Rémi Delbouys, Romain Hennequin, Francesco Piccoli, Jimena Royo-
Letelier, and Manuel Moussallam. “Music mood detection based on audio
and lyrics with deep neural net.” In: arXiv preprint arXiv:1809.07276 (2018).
[48] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Ima-
genet: A large-scale hierarchical image database.” In: 2009 IEEE conference
on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.
bibliography 145

[49] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun
Ting, Karthikeyan Shanmugam, and Payel Das. “Explanations based on
the missing: Towards contrastive explanations with pertinent negatives.”
In: Advances in neural information processing systems 31 (2018).
[50] Karen van Dijk. AI Song Contest. 2020. url: https://www.vprobroadcast.
com/titles/ai-songcontest.html (visited on 09/29/2022).
[51] Matthias Dorfer and Gerhard Widmer. “Training general-purpose audio
tagging networks with noisy labels and iterative self-verification.” In:
Proceedings of the Detection and Classification of Acoustic Scenes and Events
2018 Workshop (DCASE2018). 2018, pp. 178–182.
[52] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of inter-
pretable machine learning.” In: arXiv preprint arXiv:1702.08608 (2017).
[53] Tuomas Eerola, Anders Friberg, and Roberto Bresin. “Emotional expres-
sion in music: contribution, linearity, and additivity of primary musical
cues.” In: Frontiers in psychology 4 (2013), p. 487.
[54] Tuomas Eerola and Jonna K. Vuoskoski. “A comparison of the discrete
and dimensional models of emotion in music.” In: Psychology of Music
39.1 (2011), pp. 18–49. doi: 10 . 1177 / 0305735610362821. eprint: https :
//doi.org/10.1177/0305735610362821. url: https://doi.org/10.1177/
0305735610362821.
[55] Paul Ekman and Wallace V Friesen. “Constants across cultures in the face
and emotion.” In: Journal of personality and social psychology 17.2 (1971),
p. 124.
[56] Anders Elowsson and Anders Friberg. “Modelling perception of speed in
music audio.” In: Proceedings of the Sound and Music Computing Conference.
Citeseer. 2013, pp. 735–741.
[57] Anders Elowsson and Anders Friberg. “Predicting the perception of per-
formed dynamics in music audio with ensemble learning.” In: The Journal
of the Acoustical Society of America 141.3 (2017), pp. 2224–2242.
[58] Mehmet Bilal Er and Ibrahim Berkan Aydilek. “Music emotion recognition
by using chroma spectrogram and deep visual features.” In: International
Journal of Computational Intelligence Systems 12.2 (2019), pp. 1622–1634.
[59] Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R Arabnia.
“A brief review of domain adaptation.” In: Advances in Data Science and
Information Engineering (2021), pp. 877–894.
[60] Pedro F Felzenszwalb and Daniel P Huttenlocher. “Efficient graph-based
image segmentation.” In: International journal of computer vision 59.2 (2004),
pp. 167–181.
[61] Francesco Foscarin, Katharina Hoedt, Verena Praher, Arthur Flexer, and
Gerhard Widmer. “Concept-Based Techniques for ”Musicologist-friendly”
Explanations in a Deep Music Classifier.” In: arXiv preprint arXiv:2208.12485
(2022).
bibliography 146

[62] Anders Friberg and Anton Hedblad. “A comparison of perceptual ratings


and computed audio features.” In: Proceedings of the 8th sound and music
computing conference. 2011, pp. 122–127.
[63] Anders Friberg, Erwin Schoonderwaldt, and Anton Hedblad. “Perceptual
ratings of musical parameters.” In: Gemessene Interpretation-Computergestützte
Aufführungsanalyse im Kreuzverhör der Disziplinen, Mainz: Schott (2011),
pp. 237–253.
[64] Anders Friberg, Erwin Schoonderwaldt, Anton Hedblad, Marco Fabi-
ani, and Anders Elowsson. “Using listener-based perceptual features as
intermediate representations in music information retrieval.” In: The Jour-
nal of the Acoustical Society of America 136.4 (2014), pp. 1951–1963. doi:
10.1121/1.4892767. eprint: https://doi.org/10.1121/1.4892767. url:
https://doi.org/10.1121/1.4892767.
[65] Anders Friberg and Johan Sundberg. “Does music performance allude to
locomotion? A model of final ritardandi derived from measurements of
stopping runners.” In: The Journal of the Acoustical Society of America 105.3
(1999), pp. 1469–1484.
[66] Alf Gabrielsson. “Emotion perceived and emotion felt: Same or different?”
In: Musicae scientiae 5.1 suppl (2001), pp. 123–147.
[67] Alf Gabrielsson. “Studying emotional expression in music performance.”
In: Bulletin of the Council for Research in Music Education (1999), pp. 47–53.
[68] Alf Gabrielsson and Patrik N. Juslin. “Emotional Expression in Music
Performance: Between the Performer’s Intention and the Listener’s Ex-
perience.” In: Psychology of Music 24.1 (1996), pp. 68–91. doi: 10.1177/
0305735696241007.
[69] Alf Gabrielsson and Erik Lindström. “The influence of musical structure
on emotional expression.” In: (2001).
[70] Alf Gabrielsson and Erik Lindström. “The role of structure in the musical
expression of emotions.” In: Handbook of music and emotion: Theory, research,
applications 367400 (2010), pp. 367–44.
[71] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation
by backpropagation.” In: International conference on machine learning. PMLR.
2015, pp. 1180–1189.
[72] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky.
“Domain-adversarial training of neural networks.” In: The journal of machine
learning research 17.1 (2016), pp. 2096–2030.
[73] Shayan Gharib, Konstantinos Drossos, Emre Cakir, Dmitriy Serdyuk, and
Tuomas Virtanen. “Unsupervised adversarial domain adaptation for acous-
tic scene classification.” In: arXiv preprint arXiv:1808.05777 (2018).
bibliography 147

[74] Juan Sebastián Gómez-Cañón, Estefanı́a Cano, Tuomas Eerola, Perfecto


Herrera, Xiao Hu, Yi-Hsuan Yang, and Emilia Gómez. “Music emotion
recognition: Toward new, robust standards in personalized and context-
sensitive applications.” In: IEEE Signal Processing Magazine 38.6 (2021),
pp. 106–114.
[75] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative
adversarial nets.” In: Advances in neural information processing systems 27
(2014).
[76] Jacek Grekow. “Musical performance analysis in terms of emotions it
evokes.” In: Journal of Intelligent Information Systems 51.2 (2018), pp. 415–
437.
[77] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and
Alex Smola. “A kernel method for the two-sample-problem.” In: Advances
in neural information processing systems 19 (2006).
[78] Daniel W. Griffin and Jae S. Lim. “Signal estimation from modified short-
time Fourier transform.” In: ICASSP. 1983.
[79] NA HE and Sam Ferguson. “Multi-view Neural Networks for Raw Audio-
based Music Emotion Recognition.” In: 2020 IEEE International Symposium
on Multimedia (ISM). IEEE. 2020, pp. 168–172.
[80] Patrick Hall and Andrew Burt. Why you should care about debugging machine
learning models. 2019. url: https://www.oreilly.com/radar/why- you-
should-care-about-debugging-machine-learning-models/.
[81] Donghong Han, Yanru Kong, Jiayi Han, and Guoren Wang. “A survey of
music emotion recognition.” In: Frontiers of Computer Science 16.6 (2022),
pp. 1–11.
[82] Stephen Handel. Listening: An introduction to the perception of auditory events.
The MIT Press, 1993.
[83] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H
Friedman. The elements of statistical learning: data mining, inference, and
prediction. Vol. 2. Springer, 2009.
[84] Verena Haunschmid, Ethan Manilow, and Gerhard Widmer. audioLIME:
Listenable Explanations Using Source Separation. 13th International Workshop
on Machine Learning and Music. 2020. arXiv: 2008.00582 [cs.SD].
[85] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-
Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas
Eck. “Enabling Factorized Piano Music Modeling and Generation with the
MAESTRO Dataset.” In: International Conference on Learning Representations.
2019. url: https://openreview.net/forum?id=r1lYRjC9F7.
[86] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual
Learning for Image Recognition.” In: CoRR abs/1512.03385 (2015). arXiv:
1512.03385. url: http://arxiv.org/abs/1512.03385.
bibliography 148

[87] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual
Learning for Image Recognition.” In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016, pp. 770–778.
[88] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam.
“Spleeter: a Fast and Efficient Music Source Separation Tool with Pre-
trained Models.” In: Journal of Open Source Software 5.50 (2020). Deezer
Research, p. 2154. doi: 10.21105/joss.02154.
[89] Kate Hevner. “Experimental studies of the elements of expression in
music.” In: The American journal of psychology 48.2 (1936), pp. 246–268.
[90] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge
in a Neural Network.” In: arXiv preprint arXiv:1503.02531 (2015).
[91] Andre Holzapfel, Bob Sturm, and Mark Coeckelbergh. “Ethical dimen-
sions of music information retrieval technology.” In: Transactions of the
International Society for Music Information Retrieval 1.1 (2018), pp. 44–55.
[92] Cheng-Zhi Anna Huang, Hendrik Vincent Koops, Ed Newton-Rex, Monica
Dinculescu, and Carrie J Cai. “AI song contest: Human-AI co-creation in
songwriting.” In: arXiv preprint arXiv:2010.05388 (2020).
[93] Moyuan Huang, Wenge Rong, Tom Arjannikov, Nan Jiang, and Zhang
Xiong. “Bi-modal deep boltzmann machine based musical emotion classi-
fication.” In: International Conference on Artificial Neural Networks. Springer.
2016, pp. 199–207.
[94] Colin Humphries, Merav Sabri, Kimberly Lewis, and Einat Liebenthal. “Hi-
erarchical organization of speech perception in human auditory cortex.”
In: Frontiers in neuroscience 8 (2014), p. 406.
[95] David Huron. “Perceptual and cognitive applications in music information
retrieval.” In: Perception 10.1 (2000), pp. 83–92.
[96] Patrik N Juslin. “Emotional communication in music performance: A
functionalist perspective and some data.” In: Music perception 14.4 (1997),
pp. 383–418.
[97] Patrik N Juslin. “Emotional reactions to music.” In: The Oxford handbook of
music psychology (2016), pp. 197–213.
[98] Patrik N Juslin. Musical emotions explained: Unlocking the secrets of musical
affect. Oxford University Press, USA, 2019.
[99] Patrik N Juslin, Simon Liljeström, Daniel Västfjäll, and Lars-Olov Lundqvist.
“How does music evoke emotions? Exploring the underlying mechanisms.”
In: (2010).
[100] Patrik N Juslin and Daniel Västfjäll. “Emotional responses to music: The
need to consider underlying mechanisms.” In: Behavioral and brain sciences
31.5 (2008), pp. 559–575.
[101] Rainer Kelz and Gerhard Widmer. “Towards Interpretable Polyphonic
Transcription with Invertible Neural Networks.” In: Proceedings of the 20th
International Society for Music Information Retrieval Conference, ISMIR 2019,
Delft, The Netherlands, November 4-8, 2019. 2019, pp. 376–383.
bibliography 149

[102] Sameer Khurana, Niko Moritz, Takaaki Hori, and Jonathan Le Roux.
“Unsupervised domain adaptation for speech recognition via uncertainty
driven self-training.” In: ICASSP 2021-2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6553–6557.
[103] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. “Examples are not
enough, learn to criticize! criticism for interpretability.” In: Advances in
neural information processing systems 29 (2016).
[104] Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton,
Patrick Richardson, Jeffrey Scott, Jacquelin A Speck, and Douglas Turnbull.
“Music emotion recognition: A state of the art review.” In: Proceedings of the
11th International Society for Music Information Retrieval Conference, ISMIR
2010. Vol. 86. 2010, pp. 937–952.
[105] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic
Optimization.” In: 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
2015.
[106] Pang Wei Koh and Percy Liang. “Understanding black-box predictions via
influence functions.” In: International conference on machine learning. PMLR.
2017, pp. 1885–1894.
[107] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma
Pierson, Been Kim, and Percy Liang. “Concept bottleneck models.” In:
International Conference on Machine Learning. PMLR. 2020, pp. 5338–5348.
[108] Khaled Koutini, Hamid Eghbal-Zadeh, Matthias Dorfer, and Gerhard
Widmer. “The Receptive Field as a Regularizer in Deep Convolutional
Neural Networks for Acoustic Scene Classification.” In: 2019 27th European
signal processing conference (EUSIPCO). IEEE. 2019, pp. 1–5.
[109] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. “Receptive
field regularization techniques for audio classification and tagging with
deep convolutional neural networks.” In: IEEE/ACM Transactions on Audio,
Speech, and Language Processing 29 (2021), pp. 1987–2000.
[110] Wouter M Kouw and Marco Loog. “An introduction to domain adaptation
and transfer learning.” In: arXiv preprint arXiv:1812.11806 (2018).
[111] Carol L Krumhansl. Cognitive foundations of musical pitch. Oxford University
Press, 2001.
[112] Solomon Kullback and Richard A Leibler. “On information and suffi-
ciency.” In: The annals of mathematical statistics 22.1 (1951), pp. 79–86.
[113] Olivier Lartillot, Tuomas Eerola, Petri Toiviainen, and Jose Fornari. “Multi-
Feature Modeling of Pulse Clarity: Design, Validation and Optimization.”
In: ISMIR. Citeseer. 2008, pp. 521–526.
[114] Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola. “A matlab toolbox
for music information retrieval.” In: Data analysis, machine learning and
applications. Springer, 2008, pp. 261–268.
bibliography 150

[115] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht.
“Sliced wasserstein discrepancy for unsupervised domain adaptation.”
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 10285–10295.
[116] Jing Li, Hongfei Lin, and Lijuan Zhou. “Emotion tag based music re-
trieval algorithm.” In: Asia Information Retrieval Symposium. Springer. 2010,
pp. 599–609.
[117] Tao Li and Mitsunori Ogihara. “Detecting emotion in music.” In: (2003).
[118] Dan Liu, Lie Lu, and Hong-Jiang Zhang. “Automatic mood detection from
acoustic music data.” In: (2003).
[119] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hyejin Oh, Georges El Fakhri,
Je-Won Kang, Jonghye Woo, et al. “Deep unsupervised domain adaptation:
a review of recent advances and perspectives.” In: APSIPA Transactions on
Signal and Information Processing 11.1 (2022).
[120] Xin Liu, Qingcai Chen, Xiangping Wu, Yan Liu, and Yang Liu. “CNN
based music emotion classification.” In: arXiv preprint arXiv:1704.05665
(2017).
[121] Beth Logan and Ariel Salomon. “A Music Similarity Function Based on
Signal Analysis.” In: ICME. 2001, pp. 22–25.
[122] Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent
with Warm Restarts.” In: International Conference on Learning Representations.
2017. url: https://openreview.net/forum?id=Skq89Scxx.
[123] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. “Understanding
the effective receptive field in deep convolutional neural networks.” In:
Advances in neural information processing systems 29 (2016).
[124] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using
t-SNE.” In: Journal of machine learning research 9.11 (2008).
[125] Karl F MacDorman Stuart Ough Chin-Chang Ho. “Automatic emotion
prediction of song excerpts: Index construction, algorithm design, and em-
pirical comparison.” In: Journal of New Music Research 36.4 (2007), pp. 281–
299.
[126] Guy Madison and Johan Paulin. “Ratings of speed in real music as a
function of both original and manipulated beat tempo.” In: The Journal of
the Acoustical Society of America 128.5 (2010), pp. 3032–3040.
[127] Jens Madsen, Bjørn Sand Jensen, and Jan Larsen. “Predictive modeling
of expressed emotions in music using pairwise comparisons.” In: Interna-
tional Symposium on Computer Music Modeling and Retrieval. Springer. 2012,
pp. 253–277.
[128] Ricardo Malheiro, Renato Panda, Paulo JS Gomes, and Rui Pedro Paiva.
“Bi-Modal Music Rmotion Tecognition: Novel Lyrical Features and Dataset.”
In: 9th International Workshop on Music and Machine Learning–MML 2016–in
conjunction with the European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases–ECML/PKDD 2016. 2016.
bibliography 151

[129] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar,
Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis
in python.” In: Proceedings of the 14th python in science conference. Vol. 8.
2015, pp. 18–25.
[130] Gary J McKeown and Ian Sneddon. “Modeling continuous self-report
measures of perceived emotion using generalized additive mixed models.”
In: Psychological methods 19.1 (2014), p. 155.
[131] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
Aram Galstyan. “A survey on bias and fairness in machine learning.” In:
ACM Computing Surveys (CSUR) 54.6 (2021), pp. 1–35.
[132] Albert Mehrabian. “Basic dimensions for a general psychological theory:
Implications for personality, social, environmental, and developmental
studies.” In: (1980).
[133] Tim Miller. “Explanation in artificial intelligence: Insights from the social
sciences.” In: Artificial intelligence 267 (2019), pp. 1–38.
[134] Luca Mion and Giovanni De Poli. “Score-independent audio features for
description of music expression.” In: IEEE Transactions on Audio, Speech,
and Language Processing 16.2 (2008), pp. 458–466.
[135] Saumitra Mishra, Bob L. Sturm, and Simon Dixon. “Local Interpretable
Model-Agnostic Explanations for Music Content Analysis.” In: Proceedings
of the 18th International Society for Music Information Retrieval Conference,
ISMIR 2017, Suzhou, China, October 23-27, 2017. 2017, pp. 537–543.
[136] Christine Mohn, Heike Argstatter, and Friedrich-Wilhelm Wilker. “Percep-
tion of six basic emotions in music.” In: Psychology of Music 39.4 (2011),
pp. 503–517.
[137] Christoph Molnar. Interpretable Machine Learning. A Guide for Making Black
Box Models Explainable. https://christophm.github.io/interpretable-ml-
book/. 2019.
[138] Mitchell Ohriner. “Lyric, rhythm, and non-alignment in the second verse
of Kendrick Lamar’s “Momma”.” In: Music Theory Online 25.1 (2019).
[139] Richard Orjesek, Roman Jarina, Michal Chmulik, and Michal Kuba. “DNN
Based Music Emotion Recognition from Raw Audio Signal.” In: 29th
International Conference Radioelektronika 2019 (RADIOELEKTRONIKA). IEEE.
2019, pp. 1–4.
[140] Andrew Ortony and Terence J Turner. “What’s basic about basic emo-
tions?” In: Psychological review 97.3 (1990), p. 315.
[141] Elias Pampalk, Simon Dixon, and Gerhard Widmer. “Exploring music
collections by browsing different views.” In: Computer Music Journal 28.2
(2004), pp. 49–62.
[142] Renato Eduardo Silva Panda. “Emotion-based analysis and classification
of audio music.” PhD thesis. 00500:: Universidade de Coimbra, 2019.
bibliography 152

[143] Renato Eduardo Silva Panda, Ricardo Malheiro, Bruno Rocha, António
Pedro Oliveira, and Rui Pedro Paiva. “Multi-modal music emotion recog-
nition: A new dataset, methodology and comparative analysis.” In: 10th In-
ternational Symposium on Computer Music Multidisciplinary Research (CMMR
2013). 2013, pp. 570–582.
[144] Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. “Audio
Features for Music Emotion Recognition: a Survey.” In: IEEE Transactions
on Affective Computing (2020), pp. 1–1. doi: 10.1109/TAFFC.2020.3032373.
[145] Renato Panda, Ricardo Malheiro, and Rui Pedro Paiva. “Novel audio
features for music emotion recognition.” In: IEEE Transactions on Affective
Computing 11.4 (2018), pp. 614–626.
[146] Alessia Pannese, Marc-André Rappaz, and Didier Grandjean. “Metaphor
and music emotion: Ancient views and future directions.” In: Consciousness
and Cognition 44 (2016), pp. 61–71.
[147] Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core
Team. nlme: Linear and Nonlinear Mixed Effects Models. R package version
3.1-152. 2021. url: https://CRAN.R-project.org/package=nlme.
[148] Robert Plutchik. “The nature of emotions: Human emotions have deep
evolutionary roots, a fact that may explain their complexity and provide
tools for clinical practice.” In: American scientist 89.4 (2001), pp. 344–350.
[149] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas
F. Ehmann, and Xavier Serra. “End-to-end Learning for Music Audio
Tagging at Scale.” In: 19th International Society for Music Information Retrieval
Conference (ISMIR2018). Paris, 2018.
[150] Jonathan Posner, James A Russell, Andrew Gerber, Daniel Gorman, Tiziano
Colibazzi, Shan Yu, Zhishun Wang, Alayar Kangarlu, Hongtu Zhu, and
Bradley S Peterson. “The neurophysiological bases of emotion: An fMRI
study of the affective circumplex using emotion-denoting words.” In:
Human brain mapping 30.3 (2009), pp. 883–895.
[151] Jonathan Posner, James A Russell, and Bradley S Peterson. “The circum-
plex model of affect: An integrative approach to affective neuroscience,
cognitive development, and psychopathology.” In: Development and psy-
chopathology 17.3 (2005), pp. 715–734.
[152] Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. “Inter-
pretable Data-Based Explanations for Fairness Debugging.” In: arXiv
preprint arXiv:2112.09745 (2021).
[153] Verena Praher, Katharina Prinz, Arthur Flexer, and Gerhard Widmer. “On
the Veracity of Local, Model-agnostic Explanations in Audio Classification:
Targeted Investigations with Adversarial Examples.” In: arXiv preprint
arXiv:2107.09045 (2021).
[154] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. “Hierarchical text-conditional image generation with clip latents.”
In: arXiv preprint arXiv:2204.06125 (2022).
bibliography 153

[155] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. “”Why Should I
Trust You?”: Explaining the Predictions of Any Classifier.” In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM, 2016,
pp. 1135–1144. doi: 10.1145/2939672.2939778.
[156] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. “”Why Should I
Trust You?”: Explaining the Predictions of Any Classifier.” In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM, 2016,
pp. 1135–1144. doi: 10.1145/2939672.2939778.
[157] Peter J Richerson, Robert Boyd, and Joseph Henrich. “Gene-culture coevo-
lution in the age of genomics.” In: Proceedings of the National Academy of
Sciences 107.Supplement 2 (2010), pp. 8985–8992.
[158] Peter J Rousseeuw and Katrien Van Driessen. “A fast algorithm for the
minimum covariance determinant estimator.” In: Technometrics 41.3 (1999),
pp. 212–223.
[159] James A Russell. “A circumplex model of affect.” In: Journal of personality
and social psychology 39.6 (1980), p. 1161.
[160] James A Russell. “Core affect and the psychological construction of emo-
tion.” In: Psychological review 110.1 (2003), p. 145.
[161] James A Russell and Beverly Fehr. “Fuzzy concepts in a fuzzy hierarchy:
varieties of anger.” In: Journal of personality and social psychology 67.2 (1994),
p. 186.
[162] Ulrich Schimmack and Alexander Grob. “Dimensional models of core
affect: A quantitative comparison by means of structural equation model-
ing.” In: European Journal of Personality 14.4 (2000), pp. 325–345.
[163] Erik M Schmidt and Youngmoo E Kim. “Learning emotion-based acoustic
features with deep belief networks.” In: 2011 IEEE workshop on applications
of signal processing to audio and acoustics (Waspaa). IEEE. 2011, pp. 65–68.
[164] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna
Vedantam, Devi Parikh, and Dhruv Batra. “Grad-cam: Visual explanations
from deep networks via gradient-based localization.” In: Proceedings of the
IEEE international conference on computer vision. 2017, pp. 618–626.
[165] Yading Song, Simon Dixon, Marcus T Pearce, and Andrea R Halpern.
“Perceived and induced emotion responses to popular music: Categorical
and dimensional models.” In: Music Perception: An Interdisciplinary Journal
33.4 (2016), pp. 472–492.
[166] Erik Strumbelj and Igor Kononenko. “An efficient explanation of individ-
ual classifications using game theory.” In: The Journal of Machine Learning
Research 11 (2010), pp. 1–18.
[167] Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. “Unsupervised Do-
main Adaptation through Self-Supervision.” In: arXiv preprint arXiv:1909.11825
(2019).
bibliography 154

[168] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbig-
niew Wojna. “Rethinking the inception architecture for computer vision.”
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 2818–2826.
[169] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and
Zbigniew Wojna. “Rethinking the Inception Architecture for Computer
Vision.” In: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2016), pp. 2818–2826.
[170] Robert E Thayer. The biopsychology of mood and arousal. Oxford University
Press, 1990.
[171] WIRED.com. Jacob Collier Plays the Same Song In 18 Increasingly Complex
Emotions — WIRED. 2020. url: https://www.youtube.com/watch?v=
EWHpdmDHrn8 (visited on 09/07/2022).
[172] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng,
and Tao Qin. “Generalizing to unseen domains: A survey on domain
generalization.” In: arXiv preprint arXiv:2103.03097 (2021).
[173] David Watson, Lee Anna Clark, and Auke Tellegen. “Development and
validation of brief measures of positive and negative affect: the PANAS
scales.” In: Journal of personality and social psychology 54.6 (1988), p. 1063.
[174] Lage Wedin. “A multidimensional study of perceptual-emotional qualities
in music.” In: Scandinavian journal of psychology 13.1 (1972), pp. 241–257.
[175] Felix Weninger, Florian Eyben, and Björn Schuller. “On-line continuous-
time music mood regression with deep recurrent neural networks.” In:
2014 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE. 2014, pp. 5412–5416.
[176] Gerhard Widmer. “Applications of machine learning to music research:
Empirical investigations into the phenomenon of musical expression.”
In: Machine Learning, Data Mining and Knowledge Discovery: Methods and
Applications. Wiley & Sons, Chichester (UK) (1998).
[177] Gerhard Widmer. “Getting closer to the essence of music: The Con Espres-
sione Manifesto.” In: ACM Transactions on Intelligent Systems and Technology
(TIST) 8.2 (2017), p. 19.
[178] Minz Won, Sanghyuk Chun, and Xavier Serra. “Toward Interpretable
Music Tagging with Self-Attention.” In: CoRR abs/1906.04972 (2019). arXiv:
1906.04972.
[179] Cheng Yang. Content-based music retrieval on acoustic data. Stanford Univer-
sity, 2003.
[180] Dan Yang and Won-Sook Lee. “Disambiguating Music Emotion Using
Software Agents.” In: ISMIR. Vol. 4. 2004, pp. 218–223.
[181] Li-Chia Yang and Alexander Lerch. “On the evaluation of generative mod-
els in music.” In: Neural Computing and Applications 32.9 (2020), pp. 4773–
4784.
bibliography 155

[182] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. “Music
emotion classification: A regression approach.” In: 2007 IEEE International
Conference on Multimedia and Expo. IEEE. 2007, pp. 208–211.
[183] Marcel Zentner, Didier Grandjean, and Klaus R Scherer. “Emotions evoked
by the sound of music: characterization, classification, and measurement.”
In: Emotion 8.4 (2008), p. 494.
[184] Kejun Zhang, Hui Zhang, Simeng Li, Changyuan Yang, and Lingyun Sun.
“The PMEmo Dataset for Music Emotion Recognition.” In: Proceedings of
the 2018 ACM on International Conference on Multimedia Retrieval. ICMR ’18.
Yokohama, Japan: ACM, 2018, pp. 135–142. isbn: 978-1-4503-5046-4. doi:
10.1145/3206025.3206037.
[185] Youshan Zhang. “A Survey of Unsupervised Domain Adaptation for
Visual Recognition.” In: arXiv preprint arXiv:2112.06745 (2021).
[186] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gor-
don. “On learning invariant representations for domain adaptation.” In:
International Conference on Machine Learning. PMLR. 2019, pp. 7523–7532.

You might also like