Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features
Modelling Emotional Expression in Music Using Interpretable and Transferable Perceptual Features
Shreyan Chowdhury
Submitted at
Institute of Computational
Perception
Supervisor and
First Evaluator
Gerhard Widmer
Doctoral Thesis
to obtain the academic degree of
Technische Wissenschaften
JOHANNES KEPLER
UNIVERSITY LINZ
Altenbergerstraße 69
4040 Linz, Österreich
www.jku.at
DVR 0093696
Shreyan Chowdhury: Modelling Emotional Expression in Music Using Interpretable
and Transferable Perceptual Features, © September 2022
ABSTRACT
iii
ACKNOWLEDGMENTS
This thesis would not have been possible without the support of several people.
First of all, I am deeply grateful to my supervisor Gerhard Widmer for his
mentorship and guidance throughout my time as a PhD candidate at the Institute
of Computational Perception. Thank you for teaching me how to ask the right
research questions, how to translate vague ideas into concrete steps, and how to
properly communicate scientific research.
I would also like to thank my second evaluator, Prof. Peter Flach, for taking the
time and effort to review this thesis.
My PhD journey at the institute was exciting and enjoyable, and I have my
friends and colleagues to thank for this. Thanks to Verena Praher for the many
discussions and collaborations, and for all the fun we had during conferences;
Andreu Vall for helping me with my very first machine learning experiments,
and for all the conversations about life and work; Florian Henkel for helping me
plan my defence, and for all the conversations about music and guitar; Khaled
Koutini for helping me with my numerous machine learning questions; Carlos
Cancino-Chacón, Silvan Peter, and Hamid Eghbal-zadeh for helping me with
my research and allowing me to brainstorm ideas; Lukáš Martak for keeping
the music alive and for the various jam sessions; Rainer Kelz for the many
philosophical mini-discussions over lunch; Luı́s Carvalho for being an amazing
office mate; Alessandro Melchiorre for the Easter lunches, for the puzzles, and
for always having something fun to do; and Charles Brazier for being the life of
the party, in all parties. Thanks also to Andreas Arzt, Matthias Dorfer, Harald
Frostel, and Filip Korzeniowski, who helped me immensely when I began my
PhD and made me feel at home.
I am also grateful to Claudia Kindermann for all the administrative help and
support she provided me that made my life in Linz easier.
I feel lucky to be able to call a bunch of amazing people my best friends –
Ankit, Reha, Ritika, Vaishnavi, and Zain. I could write an entire book about
you folks, but for now, I will let a simple “thank you” convey my gratitude.
I also feel lucky to have met Vishnupriya in Linz – thank you for all the jam
sessions, conversations, lunches, dinners, and hikes, and for being one of my
closest friends. Thanks also to Venkat, who I feel fortunate to have known since
my undergraduate days, and who has always offered his friendship and support
in the sincerest ways. I would also like to thank Ashis Pati, who I have always
looked up to as a musician, as a researcher, and as a human being.
I am eternally grateful for the unconditional love and support of my family –
my brother Ryan, my parents, my grandparents, my aunt, and my cousin Ritwika.
They have been my anchor always. I also thank the Sakharwade family for their
support and encouragement.
iv
Lastly, I would like to thank my partner Nitica. Without your support, this PhD
would surely not have been possible. Thank you for your unwavering belief in
me, and for your never-ending encouragement and motivation that has kept me
going. A special thanks for your thorough proofreading of this thesis – nothing
evades your sharp eye.
The research reported in this thesis has been carried out at the Institute of
Computational Perception (Johannes Kepler University Linz, Austria) and has
been funded by the European Research Council (ERC) under the European
Union’s Horizon 2020 research and innovation programme, grant agreements No.
670035 (project ”Con Espressione”) and 101019375 (”Whither Music?”).
v
Artwork created by the author of this thesis, with a little help from AI.
The text-to-image model DALL-E [154] was used to generate reference images, which
were then used as inspiration by the author for creating this digital painting.
vi
CONTENTS
i background
2 a primer on music emotion recognition 10
2.1 Perceived, Induced, and Intended Emotions . . . . . . . . . . . . . 11
2.2 Approaches to Music Emotion Recognition . . . . . . . . . . . . . . 12
2.3 Emotion Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Challenges in Music Emotion Recognition . . . . . . . . . . . . . . 19
3 a primer on explainability in machine learning 21
3.1 Defining Explainability/Interpretability . . . . . . . . . . . . . . . . 22
3.2 Interpreting a Linear Regression Model . . . . . . . . . . . . . . . . 24
3.3 Interpreting Black-box Models Using LIME . . . . . . . . . . . . . . 26
3.4 Evaluation of Feature-based Explanations . . . . . . . . . . . . . . . 27
3.5 Explainability in Music Information Retrieval . . . . . . . . . . . . 29
vii
contents viii
iii appendix
a datasets used in this thesis 136
a.1 The Mid-level Features Dataset . . . . . . . . . . . . . . . . . . . . . 136
a.2 The Soundtracks Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 137
a.3 The PMEmo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
a.4 The MAESTRO Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.5 The DEAM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
a.6 The Con Espressione Dataset . . . . . . . . . . . . . . . . . . . . . . 140
bibliography 141
1
THIS THESIS IN A NUTSHELL
1
this thesis in a nutshell 2
1 There is an important distinction between perceived and induced emotions [66]. Perceived emotion
refers to the emotion expressed or communicated by music, while induced emotion is felt by the
listener, in their body, in response to music.
this thesis in a nutshell 3
2 Valence refers to the general positive or negative quality of an emotion, and arousal refers to the
intensity or degree to which the emotion is perceived. These emotion scales will be discussed in
detail in Chapter 2
1.1 thesis outline 4
becomes difficult to trust the predictions of a model, especially if they are being
used in critical systems such as health or finance. Non-explainable models also
do not provide “actionable-insights” about the predictions, since there are no
“if/then” connections between the inputs and the outputs. This can make the
overall system less user-friendly.
Explainable Machine Learning, or Explainable AI (XAI) is a field of artificial
intelligence (AI) that aims at making models and model predictions understand-
able by humans. Some models are interpretable by construction, such as linear
models and decision trees. In a linear model, the weights of the connections
between nodes can be interpreted as importance values. A decision tree produces
an output based on learned if/then/else rules which could be used to trace an
input to its prediction, thus providing full transparency. In other complex models,
we need to either introduce additional structural changes in a model, or analyse
a particular input-output pair using extrinsic algorithms.
There are several properties of explanations and explanation methods to con-
sider when developing useful interpretable machine learning systems. Relevant
to us in this thesis are: the expressive power of a method, and comprehensibility
of an explanation. Expressive power of a method refers to the ‘language’ or
structure of the explanations the method is able to generate. It could generate
if/then rules, decision trees, a weighted sum, natural language or something else
[137]. Comprehensibility of an explanation refers to how easily the explanations
themselves are understood by the target audience. When dealing with music
emotion models, it makes sense to explain predictions on the basis of features
that are “musically meaningful” and are informative for a human analysing the
model. This motivates the use of mid-level perceptual features (musically relevant
features that can be understood by most humans) in our work.
The basic principles and methods of explainability in machine learning are
described in more detail in Chapter 3, which serves as a primer to this topic for
the interested reader.
Explanations based on intermediate features still leave the black-box between the
actual inputs and the intermediate layer unexplained. In this chapter, we address
this by proposing a two-level explanation approach aimed at explaining the mid-
level predictions using components from the input sample (“Trace”). We explore
two approaches to decompose the input into components: 1) using spectrogram
segments, and 2) using sound sources (individual instrument tracks) of the input
music. To explain positive and negative effect of the components on mid-level
predictions, we use LIME (Local Interpretable Model-agnostic Explanations)
[155] and a variant of LIME for audio components, audioLIME [84]. We also
demonstrate the utility of this method in debugging a biased emotion model
that overestimates the valence for hip-hop songs. (This is joint work with my
colleague Verena Praher (née Haunschmid).)
Next, we narrow in on one of our goals described earlier – to be able to capture the
subtle variations in expressive character between recordings of different pianists
1.1 thesis outline 6
playing the same piece of music. To this end, it is necessary to ensure that any
model we train with available datasets work well on solo piano music as well.
Following evidence from Cancino-Chacón et al. [37], which showed that mid-level
features are effective in modelling expressive character of piano performances,
we choose to transfer our mid-level model to solo piano performances. However,
due to relatively few solo piano clips being present in the Mid-level Dataset (on
which the mid-level model was trained), it is hard to justify using the model on
data consisting entirely of that genre of music. Thus, we use an adaptive training
strategy for the mid-level feature extractor using unsupervised domain adaptation
[71], improving its performance for the target domain, and in turn, giving a better
modelling of expressive character. We also propose a novel ensemble-based self-
training method for improving further the performance of the final adapted
model. This chapter details our methods and approach for this domain-adaptive
transfer (“Transfer”).
Delving deeper into the effectiveness of mid-level features for capturing and
explaining expressivity and emotion in piano performance, in this chapter, we
take a focused look at modelling the perceived arousal and valence for Bach’s
Well-Tempered Clavier Book 1 performed by six different famous pianists. We
compare mid-level features with three other feature sets – low-level audio features,
score-based features (derived from the musical score), and features derived from
a deep neural network trained end-to-end on music emotion data. We specifically
quantify how well do these feature sets explain emotion variation 1) between
pieces, and 2) between different performances of the same piece. We find that
in addition to an overall effective modelling of emotion, mid-level features also
capture performance-wise variation better than the other features. This indicates
the usefulness of these features in our overarching goal of modelling subtle
variations in emotional expression between different performances of the same
piece (“Disentangle”). We also test the features on their generalisation capacity
for outlier performances – those performances (one for each piece) that are most
distant from the rest on the arousal-valence plane, and are held-out during
training. We find that mid-level features outperform the other feature sets in this
test, thereby indicating their robustness and generalisation capacity.
1.2 contributions
This thesis makes several novel advances in the area of explainability in music
information retrieval. The main contributions are summarised as follows:
1.3 publications
The main chapters of this thesis build on the following publications (in order of
appearance in the thesis chapters):
BACKGROUND
2
A PRIMER ON MUSIC EMOTION
RECOGNITION
Music Emotion Recognition (MER) is a task – under the broader field of Music
Information Retrieval (MIR) – that aims at developing computer systems capable
of recognising the emotional content in music, or the emotional impact of music
on a listener. MER is an interdisciplinary area that combines research from music
psychology, audio signal processing, machine learning, and natural language
processing. Research on emotional analysis of music has a long history, dating
back to the 1930’s [89], but has gained newfound interest in recent decades due
to development of technologies that have enabled direct application of emotion
recognition systems. The availability of large volumes of good quality music
recordings in digital format, the development of search-and-retrieval systems
for music, streaming platforms, the progress in digital signal processing and
machine learning have all enabled interest in, and development of, automatic
music emotion recognition systems [104].
The aim of this chapter is to provide a brief overview of current and past
approaches in music emotion recognition, while also covering some aspects of
the psychology of music emotion that are relevant for this thesis. We begin by
noting the different types of emotion – perceived, induced, and intended – that are
important for setting the scope of emotion datasets and recognition approaches.
Next, we describe past works in music emotion recognition, followed by an
in-depth look at the typical pipeline for a MER system that involves dataset
collection, model training, and model evaluation. We then explore some of the
models for naming and representing emotions from studies in psychology that
are relevant to music emotion recognition. Here we will also discuss Russell’s
two-dimensional model of representing emotions [159], which is used extensively
in this thesis.
10
2.1 perceived, induced, and intended emotions 11
• Perceived emotion: concerns the emotion the listener identifies when listen-
ing to a song, which may be different from what the composer attempted
to express and what the listener feels in response to it.
• Induced emotion: relates to the emotion that is felt by (evoked in) the
listener in response to the song. Also referred to as elicited emotion.
The relation between perceived and induced emotions has been a subject
of discussion among researchers, highlighting the complex nature of music
emotion and its manifestations. An illustrative example is the so-called ‘paradox
of negative emotion’, where music generally characterised as conveying negative
emotions (e.g., sadness, depression, anger) is often judged as enjoyable [146].
Most MER systems aim at recognising perceived emotion. This is because per-
ceived emotion is a “sonic-based phenomenon, tightly linked to auditory percep-
tion, and consisting in the listener’s attribution of emotional quality to music”
[146]. It tends to have a high inter-rater agreement when compiling emotion
data from listeners (different listeners are more likely to agree on the perceived
emotion independently of musical training or culture) [142]. On the other hand,
induced emotion is an individual phenomenon, influenced largely by personal
experiences, memory, context, and pre-existing mood.
We note here that the datasets used and experiments conducted in this thesis
all concern perceived emotion, as we are interesting in analysing the emotion
decoded by listeners from the music content. Most of the datasets available in
the literature on music emotion also describe perceived emotion, which makes
it possible for our models (presented later in this thesis) to learn from various
different emotion datasets. However, in a demonstration of real-time emotion
recognition in Chapter 8, we have access to a musician’s intended emotions, and
we visualise the recognised emotions alongside the intended emotions.
2.2 approaches to music emotion recognition 12
Some of the early works on automatic music emotion recognition were done in
the 2000’s. Huron [95] explored methods to characterise musical mood using
emotion representation models, and Liu et al. [118] used Gaussian Mixture Models
(GMMs) to predict musical mood from low-level features extracted from audio
content (such as autocorrelation-based tempo, RMS or Root Mean Squared energy
from the time domain signal, and spectral features such as spectral centroid
and bandwidth). Yang and Lee [180] framed music emotion intensity prediction
as a regression problem and used Support Vector Regression (SVR) to predict
emotional intensity from low-level acoustic features. Li and Ogihara [117] used
a Support Vector Machine (SVM) based multi-label classification approach to
classify music emotion into thirteen adjective groups, and six supergroups. They
used a small dataset (499 audio clips) annotated by one person, from which
30 low-level acoustic features were extracted to train and evaluate the model.
Yang et al. [182] cast the goal as a regression problem, and used Multiple Linear
Regression (MLR), SVR, and AdaBoost to train regressors for predicting arousal
and valence values1 . They used a dataset annotated by several human listeners
who were educated on the purpose of the experiment and the essence of the
emotion annotation procedure and the emotion scales.
Over the years, researchers expanded the feature space using features such
as MFCCs (Mel-Frequency Cepstral Coefficients), periodicity histograms, and
fluctuation patterns. Dataset sizes also increased and experiments, and techniques
such as dimension reduction (using Principal Component Analysis) started be-
ing implemented for better emotion modelling [125, 134]. More recently (post
2010), there have been several works using deep learning models such as Long
Short Term Memory (LSTM), deep belief networks, and Boltzmann machines [93,
163]. Weninger et al. [175] used segmented feature extraction and deep recur-
rent networks to predict continuous emotion across time. Chaki et al. [40] used
attention-based LSTMs for the same goal. Several modern methods use Convo-
lutional Neural Networks (CNNs, or ConvNets) for automatic feature learning
from mel-spectrogram inputs. Delbouys et al. [47] used ConvNets for feature
learning from a large music collection of around 18,000 songs, and feed-forward
dense layers attached to the ConvNet for predicting arousal and valence.
The interested reader may refer to Kim et al. [104] for an extensive survey of
early MER methods, and to Han et al. [81] for an extensive survey of modern
MER methods.
An important factor to consider during the four stages of the pipeline is whether
the music emotion recognition system should output static or dynamic emotion
predictions. Static emotion prediction for a song means that the model takes
into account the entire song and outputs a single instance of predicted values
of emotion descriptors for the song. In dynamic emotion prediction, the model
predicts emotions for several different points along the song – often the prediction
windows are as small as a few seconds. In this thesis, we will principally deal
with static emotion recognition (however, we will also demonstrate dynamic
emotion recognition in Chapter 8).
Among the four components described above, we will describe taxonomy
definition and feature extraction in detail in this chapter. Dataset creation is
beyond the scope of this thesis, but we elaborate some existing datasets that are
used in this thesis in Appendix a. Model training and evaluation is dependent on
the particular context in which the MER is to be used, and we describe this for
our context in the main chapters of this thesis. First we look at feature engineering
and extraction; emotion taxonomies are described in Section 2.3.
2.3 emotion taxonomies 14
Dimensional Categorical
Arousal
passionate
serene
happy sober
Valence sad
Annotations joyous
vigorous
dark
Music
Content Feature Classification /
Extraction Model Regression
Music
Context
Figure 2.1: Traditional MER Pipeline, adapted from Gómez-Cañón et al. [74]
The most natural ways people express emotions is through facial expressions and
words. Discrete emotion models are based around the mode of expression that
uses words, or more generally groups of words. For emotions transmitted using
language, Ekman’s set of emotion words constitute a semantically distinct set
that spans a wide range of emotions. In the context of music emotion, certain
words can convey slightly different or nuanced emotions, which is what makes
Hevner’s model and the Geneva scale relevant.
merry
joyous
exhilerated gay humorous
soaring happy playful
triumphant cheerful whimsical
dramatic bright quaint
passionate
sprightly
sensational delicate
agitated light
exciting
graceful
impetous
restless
vigorous lyrical
robust leisurely
emphatic satisfying
martial serene
ponderous tranquil
majestic quiet
exalting soothing
spiritual dreamy
lofty yielding
awe-inspiring tender
dignified sentimental
sacred pathetic longing
solemn doleful yearning
sober sad pleading
serious mournful plaintive
tragic
melancholic
frustrated
depressing
gloomy
heavy
dark
Figure 2.2: Hevner’s adjective circle, redrawn from Hevner [89] (colours added to the
clusters by the present author). Each cluster contains adjectives with similar
meaning in terms of music emotion, and neighbouring clusters represent close
emotions. One adjective in each cluster, that describes the cluster, is marked
in bold.
that all of the six basic emotions are detectable in music. One drawback
of this model is that it is not easy to define other nuanced emotions in
terms of these basic emotions, which may lead one to question the notion
of “basic-ness” of these emotions [140].
alarmed aroused
tense astonished
afraid angry excited
annoyed
distressed
frustrated
delighted
Q2 Q1
happy
Arousal
miserable pleased
Q3 Q4
sad
depressed
gloomy serene
content
at ease
bored satisfied
relaxed
calm
droopy
tiredsleepy
Valence
Figure 2.3: Russell’s circumplex, adapted from Figure 2 of Russell [159]
Tension arousal
al
rous
gy a
Ener
Val
ence
21
3.1 defining explainability/interpretability 22
Input Output
Black Box Model
Section 3.5, we lay out the scope of explainability in music information retrieval
and some of the questions that research in this area is addressing.
Explainability
Methods
Approaches Based
Approaches Based
on Mode of
on Scope
Applicarion
Model Model
Global Local
Agnostic Specific
Figure 3.2: Two ways to describe explainable AI (XAI) approaches are shown here. XAI
Methods can be described based on the scope of application (global, where
explanations are derived pertaining to the overall behaviour of the model, vs.
local, where explanations for a specific input are derived) or mode of appli-
cation (model agnostic, methods that could be applied on any model without
investigating model parameters, vs. model specific, methods that depend on
the model type).
• Novelty: Does the explanation take into account and/or report whether a
data instance to be explained comes from a region far removed from the
distribution of the training data? In such cases, the model may be inaccurate
and the explanation may be useless. The concept of novelty is related to the
concept of certainty. The higher the novelty, the more likely it is that the
model will have low certainty due to lack of data.
While it may not always be possible in practice to quantify all of these proper-
ties, they form a strong foundation to aid the design of interpretability methods.
In addition to these properties for individual explanations, it is also useful to
consider properties of explanation methods. The reader is referred to Section
3.5 of Molnar [137] (Properties of Explanations) for further reading on this topic.
Doshi-Velez and Kim [52] also provide good descriptions of taxonomies of inter-
pretability evaluation, considering the application, human users, and functional
tasks, which may provide valuable insights into the philosophy of interpretability
for the interested reader.
the simple relationship between their inputs and outputs. The easiest way to
achieve interpretability in a system is to use one of these as the predictive model.
Let us understand how by deep-diving into interpreting a linear regression model.
Linear regression is a linear approach for modelling the relationship between a
scalar response (also known as a dependent variable) and one or more explanatory
variables (also known as independent variables). A linear regression model
predicts the target as a weighted sum of the feature inputs. The linearity of the
learned relationship makes the interpretation easy [137].
Stating this mathematically, given that we have an input vector x = ( x1 , x2 , . . . x p )
and want to predict a real-valued output y, the linear regression model has the
form
p
f (x) = β 0 + ∑ x j β j (3.1)
j =1
• Binary feature: A feature that takes one of two possible values for each
instance. It is represented by numerical value 1 if the feature is present, else
with 0. It changes the estimated outcome by the feature’s weight when its
value is 1.
β̂ j
t β̂ j = (3.2)
SE( β̂ j )
The estimated weights depend on the scale of the features and thus analysing the
weights alone may not always be a meaningful approach for feature importance.
To get around the issue of differing scales, we can calculate the effect a feature has
on the output, which is simply the value of the feature multiplied by its weight:
(i ) (i )
effect j = β̂ j x j (3.3)
(i )
where x j is the value of feature j for instance i and β̂ j is the weight estimate
of the feature. The feature effects for an entire dataset are plotted as boxplots to
depict the amount and range of effect each feature has on the output, as shown
in Figure 3.3b. Note that weight and effect may have opposite signs.
Feature C
Feature B
Feature A
Feature C
Feature B
Feature A
Figure 3.3: Visualising weights of a linear model corresponding to three features and
their effects on the output. Effects are calculated by multiplying feature values
by weight for all instances in a dataset. Note that weight and effect may have
opposite signs.
This is analogous to the high stability property. If inputs are close to each other
and their model outputs are similar, then their explanations should be close
3.4 evaluation of feature-based explanations 28
Z
µS ( f , g, r, x) = D ( g( f , x), g( f , z))Px (z)dz (3.4)
z∈Nr
Faithfulness (or fidelity) measures how closely the explanation function g reflects
the model being explained. One way to do this, as given in Bhatt et al. [27], is by
measuring the correlation between the sum of attributions (or importance values
assigned to features) of features xs of an input sample x and the difference in
the output of f when these features have been set to a reference baseline. For a
subset of indices T ∈ {1, 2, . . . d}, xt = {xi , i ∈ T } denotes a sub-vector of input
features that partitions the input, x = xt ∪ xc .
x[xt =x̄t ] denotes an input where xt is set to a reference baseline while xc remains
unchanged:
!
µ F ( f , g, x) = corr ∑ g ( f , x )i , f (x) − f (x[xt =x̄t ] ) (3.5)
i∈T
2
∑N
j =1 ( f ( x ) j − g ( x ) j )
2
µ F ( f , g, x) = R = 1 − (3.6)
∑N
j=1 ( f ( x ) j − f ( x ))
2
An explanation that uses too many features to explain a model’s output may be
difficult for a human user to understand. It is thus desirable to obtain an “efficient”
3.5 explainability in music information retrieval 29
explanation – one that provides maximal information about the prediction using
minimal number of features. This often leads to a trade-off with fidelity, as using
all features to explain a prediction may be faithful to the model, but it may be too
complex for a user to understand. Bhatt et al. [27] define complexity using the
fractional contribution distribution:
| g ( f , x )i |
P g (i ) = ; Pg = {Pg (1), . . . Pg (d)} (3.7)
∑ j∈[d] | g( f , x) j |
d
µC ( f , g, x) = Ei −ln(Pg ) = − ∑ Pg (i )ln(Pg (i ))
(3.8)
i =1
The criteria discussed above will play a role in Chapter 5 of this thesis, where
we will use these to evaluate explanations derived from LIME.
FEATURE-BASED EXPLANATIONS
EXAMPLE-BASED EXPLANATIONS
KNOWLEDGE-GRAPH-BASED EXPLANATIONS
... you love the band 13th Floor Elevators that pioneered
psychedelic rock in the 60's and we thought its continuation in the
70's may interest you
Figure 3.4: Examples of possible explanations for a music recommender system. Image
recreated from Afchar et al. [6].
The task of modelling musical emotion using machine learning has been on the
horizon of music information retrieval since its early days [176] but has received
major renewed interest in recent years owing to the development of end-to-end
deep learning models that ingest unstructured data such as audio spectrograms
and waveforms and learn to predict the given high-level musical attribute directly
[58, 79, 139]. This leads to an open problem of understanding what relations such
a model is learning between such unstructured input data and abstract output
concepts. In addition to the epistemic motivation of understanding these relations
in a musical sense, the need to ensure that such predictions can be trusted when
used in a downstream task prompts one to look within the black-box model using
techniques of explainability.
Some of the most interesting scientific questions lie at the intersection of disci-
plines. One such question, straddling between the fields of perception, psychology,
musicology, and computer science, relates to the disentanglement of the effects
of the performer from the effects of the underlying musical composition on per-
ceived emotion and musical expression [8, 68]. Can explainable machine learning
shed some light on the pursuit of this elusive idea that defies traditional analytic
methods and machine learning models? Let us take the first step in this direction
by modelling musical audio data through the lens of perceptually motivated
features in the following chapter.
Part II
32
perceive: music emotion and mid-level features 33
voice’, ‘a fast tempo’, ’the melody’, ’the excellent performance’. Other factors
included situational factors (27%) e.g. ‘the weather’, memory factors (24%) e.g.
‘nostalgic recognition’, lyrics (10%), and pre-existing mood (9%). Since factors
such as a listener’s pre-existing mood and memory are not accessible to computer
systems, music emotion recognition systems typically use acoustic features that
are extracted from audio recordings using signal processing methods. Looking
at the listener responses for musical factors again, we note that the responses
relate to intuitive musical features such as ’the singing voice’, and ’the melody’,
which may not have obvious analytical definitions and thus are not accurately
represented (or extracted) using traditional signal processing methods. How, then,
could such human descriptions of musical elements be approximated using a
computer?
In this chapter, we approach music emotion modelling using mid-level perceptual
features. We propose a method to model perceived music emotion from audio
recordings using these features in a way that also provides explanations for the
emotion predictions. Mid-level features are qualities (such as rhythmic complexity,
or perceived major/minor harmonic character) that are musically meaningful
and intuitively recognisable by most listeners, without requiring music-theoretic
knowledge. It has been shown previously that there is considerable consistency in
human perception of these features, that they can be predicted relatively well from
audio recordings, and that they also relate to the perceived emotional qualities of
the music [9]. To incorporate interpretability into a deep learning music emotion
model using these features, we propose a bottleneck architecture, which first
predicts the mid-level features from audio, and consequently predicts the emotion
from these mid-level features using a linear regression model. Interpretability
is introduced in this scheme due to two factors: 1) a small number of musically
meaningful, perceptually motivated features as explanatory variables, and 2) the
linear regression part of the model, which is by construction an interpretable
model (see Section 3.2).
This chapter is organised as follows. We first look at music from a perceptual
standpoint in Section 4.1, where we discuss some of the previous research that
has gone into perceptual features for music. Next, in Section 4.2, we describe our
proposed bottleneck architecture and the three different schemes of training such
a model. Following this, in Section 4.3 we explore the datasets that we would
be using to train the explainable emotion model. We use the Mid-level Features
Dataset [9], and the Soundtracks dataset [54]. In Section 4.4, we describe the
training process in detail including performance metrics and our results. Finally,
in Section 4.5, we look at generating model-level and song-level explanations of
emotion predictions using weight plots and effects plots.
This chapter is broadly based on the following publication:
Auditory perception is not only the passive reception of sensory signals, but it
is also shaped by learning, memory, expectation, and attention. Physical sonic
events are sensed by the ears, and these response signals from the inner ear
are transformed in the brain, thus manifesting as perception. In a musical con-
text, acoustic events are sensed and perceived as musical factors; for instance a
regularly-timed pulse train is perceived as having a certain rhythm [82].
The separation between the physical world of auditory events and the per-
ceptual experience of musical factors, as noted above, has been recognised in
music information retrieval (MIR) research as well. Traditionally, MIR applica-
tions have relied on extracting information from audio (usually in digital format)
using signal processing methods to describe the audio and music content at
the lowest (close-to-signal) level of representation. Features capturing this infor-
mation include time-domain features like amplitude envelope, energy content,
zero-crossing rate, frequency-domain features like spectral flux, spectral centroid,
mel-frequency cepstral coefficients (MFCCs), and the statistical properties of
these features across time. These features were typically engineered through
trial-and-error or intuition and have little relevance to how sound and music is
perceived by humans. This discrepancy has been referred to in the literature as
the semantic gap: the gap between these low-level descriptors and the auditory
concepts that listeners use to relate and interact with music [39]. Figure 4.1 depicts
the different levels of features starting from low-level features at the bottom to
semantic descriptors on the top.
This led to attempts to come up with better features more relevant in the
musical context. Some examples include using spectral envelopes to identify
timbral similarities [16, 121] and capturing useful features from the rhythm of a
song by using periodicity histograms [141] and temporal sequences [179]. More
4.1 a hierarchical view of music perception 35
Semantic Description
motional qualities
understanding
Context,
memory,
experience
Perceptual Features
rhythm
mode speed
harmony meter
melody
accentuation
pulse articulation
Low-level Features
intensity
loudness
pitch
note onsets
frequency
spectrum percussive events
Figure 4.1: A hierarchy of features, roughly depicting the experience of musical audition
in humans, adapted from Fig. 1 of Friberg et al. [64]. From an auditory signal,
we sense low-level features like pitch and intensity. These are then organ-
ised and interpreted by the brain into what we may refer to as “perceptual
features”, and subsequently processed into higher level aspects like emotion
and understanding, where context, memory, and experience also play a role.
This idea of multi-step processing of higher level aspects has been outlined
previously in Gabrielsson and Lindström [70].
4.1 a hierarchical view of music perception 36
1. Speed (slow – fast): Indicates the general speed of the music disregarding
any deeper analysis such as the tempo, and is easy for both musicians and
non-musicians to relate to [126].
2. Rhythmic Clarity (flowing – firm): Indicates how well the rhythm is accentu-
ated disregarding the actual rhythmic pattern. This would presumably be
similar to pulse clarity as modelled by Lartillot et al. [113].
1 Further details about these features can also be found in Friberg et al. [63].
4.1 a hierarchical view of music perception 37
The fact that a handful of features are able to explain a significant amount of
variation in emotion is important for us from an interpretability perspective (recall
the property of comprehensibility from Section 3.1.1 and the complexity metric
from Section 3.4). But how accurately can these perceptual features be predicted
from audio content? Friberg experimented predicting the perceptual feature
ratings with several low-level features extracted from the audio using tools such
as MIRToolbox [114]. These features included some of the features mentioned
earlier – zero crossing rate, MFCCs (Mel-Frequency Cepstral Coefficients), spectral
centroid, spectral flux, RMS, event density, silence ratio, pulse clarity, etc. However,
it was found that these features are not able to model the perceptual features
near as well as desired. In best cases, about 70% variation in a perceptual feature
was explained by low-level features. This motivates the development of models
targeted specifically toward each perceptual feature.
Noting that perceptual features are not predicted sufficiently well from audio
using low-level features, we turn to deep end-to-end models. These models are
known to learn relevant features from data. The current state of the art in audio
machine learning involves time-frequency
The concept of perceptual features for emotional expression in music has been
discussed prior to this work. Juslin’s lens model [96] considers proximal cues as a
medium of emotion transmission, which are similar to the perceptual features
of Friberg. Gabrielsson and Lindström [70] also described the idea of multi-level
processing of emotional expression in music. However, Friberg’s work is one of
the first to use a direct determination of these perceptual features using ratings
obtained from human participants.
Friberg conducted a listening experiment with musical stimuli consisting of a
set of ringtones and film music, with the participants rating each of the perceptual
features along a 9-step Likert scale. In a separate experiment, participants were
asked to rate emotions. For all the ratings, inter-rater agreement and correlations
were calculated. The inter-rater correlations were lower for the features that a
priori would be more difficult to rate, like harmonic complexity and rhythmic
complexity. Additionally, and somewhat unexpectedly, most of the feature ratings
showed modest correlation, and some, such as pitch and timbre showed a strong
correlation (correlation coefficient r = 0.90). The authors mention two probable
causes: covariation in the music examples, or listeners not being able to isolate
each feature as intended.
Nevertheless, an interesting finding from these ratings is that they appear to
hold good predictive power for the emotion dimensions of energy and valence.
The perceptual features dynamics, speed, articulation, and modality together ex-
plained 91% variance in energy, while modality, dynamics, and harmonic complexity
explained 78% variance in valence.
The fact that a handful of features are able to explain a significant amount of
variation in emotion is important for usrepresentations of audio (spectrograms)
as image-like inputs to convolutional neural networks. We choose to use models
4.2 the mid-level bottleneck architecture 39
2 The terms “mid-level features” and “perceptual features” have been used interchangeably here.
Going forward, we will just use the term “mid-level features”. In and after Section 4.3, where seven
specific mid-level features are introduced, “mid-level features” will refer particularly to those seven
features, unless mentioned otherwise.
3 Our model architecture along with experimental results were first published in Chowdhury et al.
[45]. The following year, Koh et al. [107] published a paper formally defining “Concept Bottleneck
Models” in the computer vision domain. While the basic idea of the architectures and the training
schemes are essentially identical in both, in this thesis we adopt the naming convention for the
training schemes (Section 4.2.2) from Koh et al. [107], since their names for the training schemes
are more general. One important difference between our architecture and theirs is that in our case,
we assert for the mapping between the mid-level (or concept) layer and the final output layer to be
linear, in order to have this mapping fully interpretable.
4.2 the mid-level bottleneck architecture 40
Convolutional
Feature Extractor
Figure 4.2: The mid-level bottleneck model learns to map inputs to mid-level features
on an intermediate layer, and subsequently predicts the final emotion values
using these mid-level feature values. The connection between the mid-level
layer and the emotion output is linear, lending interpretability in terms of
learnt weights.
are derived entirely from the mid-level feature layer, thus relying completely on
the information passing through this bottleneck. In our implementations, the
first part of the model, ĝ, consists of a convolutional (feature-extractor) part, and
an adaptive pooling and linear mapping part denoted as φ̂. The purpose of φ̂
is to map the features extracted by the convolutional model (which may have
a dimensionality not equal to k) to the k-dimensional mid-level space. Further,
we choose f to be a linear model as well, allowing the mid-level-to-emotion
part of the model to be completely transparent. We shall see in Section 4.5 how
explanations can be derived using this linear architecture.
Assuming that we have labelled data for the mid-level and the emotion task, and
that both of these are regression problems, the bottleneck architecture described
above can be trained in three different ways. We have the training data points
{(x(i) , y(i) , m(i) )}in=1 , where m ∈ Rk is a vector of k mid-level features. Let LM :
Rk × Rk 7→ R+ be a loss function that measures the discrepancy between the
predicted and true mid-level feature vectors, and let L E : R p × R p 7→ R+ measure
the discrepancy between predicted and true emotion vectors. Then, we have the
following ways to learn the bottleneck model4 ( fˆ, ĝ):
1. The Independent Scheme: The model components f and g are trained inde-
pendently using the respective ground-truth data: fˆ = arg min f L E ( f (m), y),
and ĝ = arg ming L M ( g(x), m). Note that during training time, f is trained
with the ground truth values of mid-level features, while during test time,
it takes ĝ(x) as the input.
2. The Sequential Scheme: The model components f and g are trained se-
quentially. While ĝ is learned in the same way as above, fˆ is learned by
INDEPENDENT
SEQUENTIAL
JOINT
Figure 4.3: Three schemes for training the bottleneck model. In the independent scheme,
the two model parts are trained independently from their respective datasets.
In the sequential scheme, f takes the outputs of trained ĝ as its inputs. In the
joint scheme, the entire model is trained as a whole by combining the loss
signals from both outputs.
3. The Joint Scheme: The model components f and g are trained jointly
similar to a multi-task optimisation process. The losses of both the tasks
are added (with the relative weights controlled by a parameter λ) and
the entire model is optimised together. ( fˆ, ĝ) = arg min f ,g [λL M ( f (x), m) +
L E ( f ( ĝ(x)), y)].
While the Inception v3 model used by Aljanaki and Soleymani [9] gives decent
performance in modelling mid-level features, it relies on large-scale pre-training
with additional data followed by fine-tuning on the mid-level data. We chose to
instead use a smaller architecture as our backbone to avoid the pre-training step.
We start with the architecture used by Dorfer and Widmer [51] and modify the
layers until we see no further improvement in validation set performance. Effec-
tively, we find that making the model shallower but wider improves performance.
This architecture is what we refer to in this thesis as “VGG-ish”. The layers of this
architecture are shown in Table 4.1a.
Residual neural networks [86] have been the architecture of choice in the computer
vision domain ever since their state-of-the-art performance in the Imagenet [48]
challenge. However, their use in the audio domain has been limited.
Koutini et al. [108] identified that if we reduce the receptive field5 of a typical
ResNet, it improves performance on audio tasks, even beating the VGG-ish
models. We use this architecture as our second backbone model and referred
to as the Receptive-Field Regularised ResNet, or “RF-ResNet”. The layers of this
architecture are shown in Table 4.1b.
5 The receptive field of a convolutional network is defined as the size of the region in the input that
produces the feature [12].
6 https://osf.io/5aupt/
4.3 data: mid-level and emotion ratings 43
Conv2D (k=5, s=2, p=2) [64] Conv2D (k=5, s=2, p=1) [128]
BatchNorm2D [64] + ReLU BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [64]
Conv2D (k=3, s=1, p=1) [128]
BatchNorm2D [64] + ReLU
BatchNorm2D [128] + ReLU
MaxPool2D (k=2) + DropOut (0.2) Conv2D (k=1, s=1) [128]
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [128]
MaxPool2D (k=2, s=2)
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [128] Conv2D (k=3, s=1, p=1) [128]
BatchNorm2D [128] + ReLU BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1) [128] ×2
MaxPool2D (k=2) + DropOut (0.2)
BatchNorm2D [128] + ReLU
Conv2D (k=3, s=1, p=1) [256] MaxPool2D (k=2, s=2)
BatchNorm2D [256] + ReLU
Conv2D (k=3, s=1, p=1) [256]
Conv2D (k=3, s=1, p=1) [256]
BatchNorm2D [256] + ReLU
BatchNorm2D [256] + ReLU
Conv2D (k=3, s=1) [256]
Conv2D (k=3, s=1, p=1) [384]
BatchNorm2D [256] + ReLU
BatchNorm2D [384] + ReLU
Conv2D (k=3, s=1, p=1) [512] Conv2D (k=1, s=1, p=1) [512]
BatchNorm2D [512] + ReLU BatchNorm2D [512] + ReLU
Conv2D (k=3, s=1, p=1) [256] Conv2D (k=1, s=1) [512]
BatchNorm2D [256] + ReLU BatchNorm2D [512] + ReLU
Table 4.1: Our two backbone model architectures. (a) shows the VGG-ish model and
(b) shows the receptive-field regularised residual neural network model (RF-
ResNet). The numbers in square brackets represent the number of channels
at the output of the corresponding layer. The blocks marked with border in
the RF-ResNet model are the residual blocks, with direct identity connections
between the input and output (not shown). k: kernel size, s: stride, p: padding.
thesis. Thus, now would be a good opportunity to take a deeper look into this
dataset.
While Friberg et al. studied nine perceptual features, Aljanaki et al. use a reduced
set of seven (described in Table 4.2). They select their feature set from concepts
found recurring in literature [62, 69, 174].
4.3 data: mid-level and emotion ratings 44
Table 4.2: Perceptual mid-level features as defined in [9], along with questions that were
provided to human raters to help them interpret the concepts. (The ratings
were collected in a pairwise comparison scenario.) In the following, we will
refer to the last one (Modality) as ‘Minorness’, to make the core of the concept
clearer.
Although the names of these concepts are derived from musicology, their use
in the current context is not restricted to their formal definitions found in their
original context. For instance, the concept of articulation is defined formally for a
single note (it can also be extended to a group of notes). However, applying it to a
real-life recording with possibly several instruments and voices is not an easy task.
To ensure common understanding between the raters, a pairwise comparison
based strategy was adopted (described in Section 4.3.1), where participants were
asked to listen to two audio clips and rank them according to the questions listed
in Table 4.2. These questions also make up the “definitions” of the concepts, as
used in the current context. The general principle is to consider the recording as
a whole.
Music Selection
The music (in the form of audio files) in the dataset comes from five sources:
Jamendo (www.jamendo.com), Magnatune (magnatune.com), the Soundtracks
dataset [54], the Bi-modal Music Emotion dataset [128], and the Multi-modal
Music Emotion dataset [143]. Overall, there was a restriction of no more than five
songs from the same artist. From each selected song, a 15-second clip from the
4.3 data: mid-level and emotion ratings 45
1000
800
Count 600
400
200
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
melodiousness articulation rhythm_complexity rhythm_stability
1000
800
Count
600
400
200
0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
dissonance tonal_stability minorness
middle of the song was extracted to construct the set of audio samples to be rated
by the human participants.
Comparing two items using a certain criterion is easier for humans than giving
a rating on an absolute scale [127]. Based on this, Aljanaki and Soleymani [9]
used pairwise comparisons (according to the seven mid-level descriptors listed
in Table 4.2) to get rankings for a small subset of the dataset, which was then
used to create an absolute scale on which the whole dataset was then annotated.
The annotators were required to have some musical education and were selected
based on passing a musical test. The ratings range from 1 to 10 and were scaled
by a factor of 0.1 before being used for our experiments. The distributions of the
annotations for the seven mid-level features are plotted in Figure 4.4
The Soundtracks7 (Stimulus Set 1) dataset, published by Eerola and Vuoskoski [54],
consists of 360 excerpts from 110 movie soundtracks. The excerpts come with
expert ratings for five emotions following the discrete emotion model (happy,
sad, tender, fearful, angry) and three emotions following the dimensional model
(valence, energy, tension). This makes it a suitable dataset for musically conveyed
emotions [54]. The ratings in the dataset range from 1 to 7.83 and were scaled
by a factor of 0.1 before being used for our experiments (see Figure 4.5 for the
distributions). All the songs in this set are also contained in the Mid-level Features
Dataset, and are annotated with the mid-level features, giving us both emotion
and mid-level ratings for these songs.
7 https://www.jyu.fi/hytk/fi/laitokset/mutku/en/research/projects2/past-projects/coe/
materials/emotion/soundtracks
4.4 model training and evaluation 46
50
40
30
Count
20
10
0
2 4 6 2 4 6 2 4 6
valence energy tension
40
Count
20
0 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8
anger fear happy sad tender
(b) Categorical emotions. The magnitudes reflect how intensely an emotion is perceived for a song.
In order to enable comparisons with Aljanaki and Soleymani [9], our main
performance metric is going to be the same as the one used in that paper: Pearson
correlation coefficient (r) between the predicted and actual values of mid-level
features or emotions, as may be the case. Additionally, we also measure the
coefficient of determination (R2 -score), and the root mean squared errors (RMSE).
The performance metrics are averaged over 8 runs and the mean values are
reported. We would like our models to maximise the average Pearson correlation
coefficient across all mid-level features and across all emotions. An important
metric we desire from our bottleneck models is that they must not hamper
the performance of the main task (emotion prediction), compared to a non-
bottleneck model. Moreover, the performance of mid-level feature prediction
by the bottleneck model should be comparable to that from a non-bottleneck
model, for valid explanations. Thus, we require that the cost of explainability be
reasonably low for both these tasks. The Pearson correlation coefficient and the
Cost of Explainability (CoE) are defined below.
4.4 model training and evaluation 47
The Pearson correlation coefficient r measures the linear relationship between two
datasets. Given paired data {( x1 , y1 ), . . . , ( xn , yn )} consisting of n pairs, r ( x, y) is
defined as:
where n is sample size, xi , yi are the individual sample points indexed with i,
and x̄ = n1 ∑in=1 xi (the sample mean); and analogously for ȳ.
It varies between −1 and +1 with 0 implying no correlation. Correlations of
−1 or +1 imply an exact linear relationship. Positive correlations imply that as x
increases, so does y. Negative correlations imply that as x increases, y decreases.
Due to the fact that we introduce a bottleneck (viz. the 7 mid-level predictions) as
inputs to the subsequent linear layer predicting emotions, our hypothesis is that
doing so should result in a decrease in the performance of emotion prediction,
relative to an A2E model that predicts emotion directly. We calculate this cost as
the difference in performance metrics between the two models for each emotion.
More precisely, we subtract the metric for the bottleneck model from the end-to-
end A2E model:
where µA2E is the performance metric for an emotion as obtained using the A2E
model, and µA2Mid2E is the performance metric for the emotion as obtained using
the A2Mid2E model. When the performance metric is the Pearson correlation
coefficient or the R2 -score, a positive CoE will indicate a reduction in performance
for that emotion caused by introducing the bottleneck, whereas a negative CoE
will indicate an improvement.
4.4.2 Preprocessing
The Mid-level Dataset contains audio clips 15 seconds in length each. In order to
be fed into our convolution-based models, we convert these to magnitude spectro-
grams. Each audio snippet is resampled at 22.05 kHz, with a frame size of 2048
samples and a frame rate of 31.3 frames per second, and amplitude-normalised
before computing the logarithmic-scaled spectrogram. The spectrograms are com-
puted using the LogarithmicFilteredSpectrogramProcessor from the madmom
[29] package, with a frequency resolution of 24 bands per octave, resulting in 149
frequency bands. We experimented with different snippet lengths to be fed into
the model, and found that longer snippets perform better. Therefore, we use the
entire 15 seconds of available audio for each clip, resulting in spectrograms of
size 469 × 149.
4.4 model training and evaluation 48
0.7
0.5
0.4 VGG-ish
RF-ResNet
0.3 0 10 20 30 40 50 60 70
Epoch
Figure 4.6: Training curve for mid-level baseline models. The RF-ResNet architecture
trains faster and gives a better performance than the VGG-ish model.
We train our models using the Adam optimiser [105] with a learning rate of 0.001
and a batch size of 8. After hyperparameter optimisation, we arrive at optimal
values of β 1 = 0.73 and β 2 = 0.918 for the two moment parameters of the Adam
optimiser, which are different than the default values of 0.9 and 0.999, respectively,
typically recommended in the literature.
Using a learning rate scheduler on top of the Adam optimiser results in signifi-
cant performance gains. In our experiments, we found that cosine annealing with
warm restarts [122] gave the biggest performance improvement. We additionally
use early stopping on the R2 -score of the validation set to prevent over-fitting.
The Mid-level Dataset is split into train-validation-test sets as described in
Aljanaki and Soleymani [9], with 8% of the data held out as the test set with no
common artists with the train or validation sets. The validation set is constructed
with 2% of data samples.
4.4.4 Baselines
Table 4.3: Pearson correlation coefficient between predictions and ground-truth values for
mid-level feature predictions using our models, compared with those reported
by Aljanaki and Soleymani [9].
Since our models are significantly smaller than the Inception v3 architecture, we
expect them to not overfit on the training data, which is a concern in the case of
the Inception v3 model. In our experiments we found that reducing the size and
complexity of the models improves the performance up to a point, and this per-
formance is already better than Aljanaki and Soleymani [9]’s baseline. Therefore,
while using our smaller and simpler models, pre-training is not necessary. The
metrics obtained using our models is given in Table 4.3, and the training curves
for the two backbone variants are shown in Figure 4.6.
Second, we use our backbone models to model the emotion directly, without
going through a mid-level bottleneck. This will be the basis of the non-bottleneck
part of the baseline, using which we will compute the cost of explainability. We
train this model by using the audio and emotion annotations from the Soundtracks
dataset.
As described earlier in Section 4.2.2, the independent scheme trains the two mod-
els f and g separately, each using its own relevant annotated data. Model g learns
to predict mid-level features using the model training parameters mentioned
above, while f is a simple linear regression model that is optimised analytically.
During test time however, the trained model fˆ ingests the outputs from the
trained model ĝ. The performance metrics of this scheme are detailed in Table 4.5
(using VGG-ish backbone) and Table 4.6 (using RF-ResNet backbone). We observe
that the cost of explanation is maximum in this scheme. The loss in performance
of emotion prediction could be attributed to the difference in the distributions
of the mid-level feature values in the training data (ground truth values) vs. the
distribution of the mid-level features values at test time (values predicted by ĝ.
4.4 model training and evaluation 50
In the sequential scheme, instead of training f using the true mid-level feature
annotations as inputs, we use the outputs of the trained model ĝ as the inputs
to f . We expect the performance of the final emotion prediction to be better in
this case because f is trained on the actual distribution of ĝ(x), resulting in a
lower cost of explanation. The performance metrics of this scheme are given in
Table 4.5 (using VGG-ish backbone) and Table 4.6 (using RF-ResNet backbone).
We see that indeed the average performance of emotion improves and the cost of
explanation reduces.
Finally, in the joint scheme, we use a multi-task approach to train the entire
network that, ideally, could learn an internal representation useful for both
prediction tasks, while keeping the interpretability of the linear weights. This
network learns to predict mid-level features and emotion ratings jointly, but
still predicts the emotions directly from the mid-level via a linear layer. This is
achieved by the second last layer having exactly the same number of units as there
are mid-level features (7), followed by a linear output layer with 8 outputs. From
this network, we extract two outputs – one from the second last layer (”mid-level
layer”), and one from the last layer (”emotion layer”). We compute losses for both
the outputs and optimise the combined loss (summation of both the losses). The
losses can be weighed differently using the parameter λ (in our experiments, we
found that λ = 2.0 gives the optimal performance for the two tasks). The results
of the joint training are presented in Table 4.5 (using VGG-ish backbone) and
Table 4.6 (using RF-ResNet backbone).
4.4.8 Results
First, from Table 4.4 we verify that our cross-validated performance metrics of
modelling emotion ratings using mid-level feature ratings match those of Aljanaki
and Soleymani [9]. Next, we look at the emotion prediction metrics using the
direct A2E model, and compare it to the bottleneck models (A2Mid2E) trained
using the three schemes: independent, sequential, and joint.
The results reflect the expected trends. In both Table 4.5 (VGG-ish backbone)
and Table 4.6 (RF-ResNet backbone), we see that the independent scheme results
in maximum cost of explanation, followed by the sequential and the joint, with
the joint training showing the minimum cost. A notable observation is that
the average correlation for emotion prediction using the mid-level annotations
(Table 4.4) is close to the average correlation for emotion prediction directly from
audio (A2E row in Table 4.5 and Table 4.6), which suggests that mid-level features
are able to capture as much information about emotion variation as a direct A2E
model with the full spectrograms as inputs. For both the VGG-ish backbone and
the RF-ResNet backbone models, the metrics for the jointly trained bottleneck
variant are very close to the corresponding direct A2E models.
4.5 obtaining explanations 51
Mid2E (Aljanaki) 0.88 0.79 0.84 0.65 0.82 0.81 0.73 0.72 0.78
Mid2E (Ours) 0.88 0.80 0.85 0.68 0.83 0.82 0.75 0.73 0.79
Table 4.4: Modelling emotions in Soundtracks dataset using annotated mid-level feature
values. The numbers are Pearson correlation coefficient values obtained using
a linear regression model.
A2E 0.81 0.79 0.84 0.82 0.81 0.66 0.60 0.75 0.76
A2Mid2Eind 0.66 0.69 0.65 0.66 0.67 0.57 0.43 0.52 0.61
A2Mid2Eseq 0.79 0.74 0.78 0.72 0.77 0.64 0.58 0.67 0.71
A2Mid2Ejoint 0.82 0.78 0.82 0.76 0.79 0.65 0.64 0.72 0.75
CoEind 0.15 0.10 0.19 0.18 0.14 0.09 0.17 0.23 0.15
CoEseq 0.02 0.05 0.06 0.10 0.03 0.02 0.02 0.08 0.05
CoEjoint −0.02 0.01 0.02 0.06 0.02 0.01 −0.04 0.03 0.01
Regarding the differences between the VGG-ish and the RF-ResNet models, we
see that the RF-ResNet performs slightly better in terms of A2E performance as
well as in terms of overall costs of explanation for all three variants of A2Mid2E.
Going forward, we will use the jointly trained version of the VGG-ish model, in
order to keep the results consistent with Chowdhury et al. [45]. The slight differ-
ence in performance for the two models does not greatly affect the explanation
process.
A2E 0.83 0.88 0.82 0.89 0.82 0.65 0.71 0.72 0.79
A2Mid2Eind 0.75 0.76 0.71 0.66 0.71 0.68 0.56 0.62 0.68
A2Mid2Eseq 0.76 0.81 0.75 0.74 0.74 0.72 0.61 0.65 0.73
A2Mid2Ejoint 0.82 0.88 0.84 0.82 0.83 0.69 0.74 0.70 0.79
CoEind 0.08 0.12 0.11 0.23 0.11 −0.03 0.15 0.10 0.11
CoEseq 0.07 0.07 0.07 0.15 0.08 −0.07 0.10 0.07 0.04
CoEjoint 0.01 0.00 −0.02 0.06 0.00 −0.04 −0.03 0.02 0.00
intercept (bias term). Figure 4.9 shows the effects for the joint model computed
over the held-out set.
The particular model, in this case, is the jointly trained variant of the VGG-ish
model from above. In the following, all the statistics and explanations will be
based on this model, in order for this presentation to be consistent with the results
published in Chowdhury et al. [45].
First we will show how this can be used to provide model-level explanations
and then we will explain a specific example at the song level.
Before a model is trained, the relationship between features and response variables
can be analysed using correlation analysis. The pairwise correlations between
mid-level and emotion annotations in our data are shown in Figure 4.7a. When
we compare this to the effect plots in Figure 4.9, or the actual weights learned for
the final linear layer (Figure 4.7b) it can be seen that for some combinations (e.g.,
valence and melodiousness, happy and minorness) positive correlations go along
with positive effect values and negative correlations with negative effect values,
respectively. This is not a general rule, however, and there are several examples
(e.g., tension and dissonance, energy and melody) where it is the other way
around. The explanation for this is simple: correlations only consider one feature
in isolation, while learned feature weights (and thus effects) also depend on the
other features and must hence be interpreted in the overall context. Therefore
it is not sufficient to look at the data in order to understand what a model has
learned.
To get a better understanding, we look at each emotion separately, using the
effects plot given in Figure 4.9. In addition to the trend of the effect (positive
or negative) – which we can also read from the learned weights in Figure 4.7b
(but only because all of our features are positive) – we can also see the spread
4.5 obtaining explanations 53
valence
tension
energy
tender
happy
anger
fear
sad
minorness -0.39 -0.26 0.34 0.29 0.42 -0.75 0.41 -0.21
0.6
tonal_stability 0.72 -0.24 -0.67 -0.46 -0.66 0.49 0.23 0.52 0.4
dissonance -0.83 0.41 0.79 0.60 0.76 -0.44 -0.41 -0.64 0.2
rhythm_stability 0.34 0.13 -0.24 -0.14 -0.31 0.33 -0.02 0.13 0.0
rhythm_complexity -0.39 0.44 0.42 0.30 0.34 -0.08 -0.38 -0.34 0.2
0.4
articulation -0.36 0.75 0.46 0.41 0.26 0.11 -0.49 -0.50
0.6
melodiousness 0.78 -0.39 -0.73 -0.55 -0.74 0.37 0.46 0.61
0.8
(a) Pairwise correlation between mid-level and emotion annotations.
valence
tension
energy
tender
happy
anger
fear
sad
minorness -0.01 -0.17 0.47 0.15 0.28 -0.41 0.27 -0.01
0.4
tonal_stability 0.19 -0.26 0.21 -0.19 0.24 0.42 0.10 0.37
dissonance -0.31 0.03 0.05 0.18 0.57 -0.46 -0.14 0.01 0.2
rhythm_stability 0.06 0.04 -0.30 -0.21 -0.24 -0.16 0.36 0.08 0.0
rhythm_complexity 0.27 0.28 0.38 0.11 0.11 0.37 0.11 -0.03
0.2
articulation -0.04 0.44 0.24 0.25 0.01 0.18 -0.54 -0.20
melodiousness 0.48 0.10 -0.36 -0.14 -0.39 -0.01 0.01 0.47 0.4
Figure 4.7: Comparing the correlations in the Soundtracks dataset with the learned
weights of fˆ mapping the mid-level feature space to the emotion space.
of the effect which tells us more about the actual contribution the feature can
have on the prediction, or how different combinations of features may produce
a certain prediction. Notably, we find many intuitive relationships between the
mid-level features and emotions. For example, we can see that minorness has
a large positive effect on “sad”, “tension”, and “anger” emotions, and a large
negative effect on “happy”. Another intuitive relationship reflected in the effect
plots is that “tender” has a large positive effect from the “melodiousness” and
“tonal stability” features.
Effect plots also permit us to create simple example-based explanations that can be
understood by a human. The feature effects of single examples can be highlighted
in the effects plot in order to analyse them in more detail, and in the context
4.6 discussion and conclusion 54
0.3
0.2
0.1
valence energy tension anger fear happy sad tender
Emotion
Figure 4.8: Emotion prediction profiles for the two example songs #153 and #322. These
two examples were chosen as they have similar emotion profiles but different
mid-level profiles. The mid-level feature effects are shown on the next figure
(Figure 4.9 on page 55) as red and blue points.
of all the other predictions. To show an interesting case we picked two songs
with similar emotional but different mid-level profiles. To do so we computed
the pairwise euclidean distances between all songs in emotion (dE ) and mid-level
space (dMid ) separately, scaled both to the range [0, 1] and combined them as
dcomb = dE − (1 − dMid ). We then selected the two songs from the Soundtracks
dataset that maximised dcomb . The samples are shown in Figure 4.9 as a red
square (song #153) and a blue dot (song #322). The reader can listen to the
songs/snippets by downloading them from the Soundtracks dataset page.
As can be seen from Figure 4.9 and from the emotion prediction profile of the
two songs (see Figure 4.8), both songs have relatively high predicted values for
tension and energy, but apparently for different reasons: song #322 more strongly
relies on “minorness” and “articulation” for achieving its “tense” character; on
the other hand, its rhythmic stability counteracts this more strongly than in the
case of song #153. The higher score on the “energy” emotion scale for #322 seems
to be primarily due to its much more articulated character (which can clearly be
heard: 153 is a saxophone playing a chromatic, harmonically complex line, 322 is
an orchestra playing a strict, staccato passage).
57
5.1 the unexplained part of a bottleneck model 58
most responsible for the output predicted class. Grad-CAM computes gradients
for a target class and traces these back to the input layer, producing a coarse
localisation map. LIME trains an interpretable surrogate model on perturbations
of a particular input-output pair, and computes importance values (e.g. in terms
of weights of a linear model) for the input features (which in the case of images
could be a segmentation map1 ).
The LIME method is particularly attractive to us due to its simplicity and
usefulness for our particular application. It allows us to examine ĝ in a post-hoc
fashion, without any changes in the model or training procedure. An additional
advantage of using LIME, as we will see later on in this chapter, is that it allows
for different kinds of interpretable representations of the input. This enables us
to choose the input modality that works best in terms of properties relevant for
our use-case. In this chapter, we will examine two different input modalities:
spectrogram components and source-separated audio stems.
Previous work on this topic includes the work by Mishra et al. [135] who intro-
duced SLIME (Sound-LIME), which used LIME over an input space composed of
rectangular time-frequency regions of a spectrogram to identify which regions
are important for singing-voice detection. Other recent work on explainability in
music information retrieval tasks include interpretable music transcription using
invertible networks [101], interpretable music tagging using attention layers [178],
and explainability in recommender systems [7].
Circling back to our current theme of explaining music emotion predictions,
in this chapter, we see how to break down the explanations into two levels –
the first using mid-level features (as in Chapter 4), and the second by using
input spectrogram/audio components – with the added objective of making the
explanations listenable.
Mid-level
features
SECOND LEVEL
EXPLANATIONS
(SPECTROGRAM
OR WAVEFORM)
Interpretable
Representation FIRST LEVEL
EXPLANATIONS
(EFFECTS PLOT)
Surrogate Model
Training Using LIME Select mid-level
feature to explain
Figure 5.1: Schematic of the two-level explanation procedure. A trained bottleneck model
( fˆ ◦ ĝ) is used to obtain emotion predictions ( fˆ( ĝ(x)) and intermediate mid-
level feature values ( ĝ(x)) for that prediction. The mid-level feature values are
then explained via LIME (Local Interpretable Model-agnostic Explanations)
using an interpretable decomposition of the input.
number of hand-crafted features tend to have more complexity and less compre-
hensibility. For instance, it is hard for a typical user to form an intuition about
MFCCs, a very popular feature in audio processing. In contrast, if the explanatory
components have the property of being clearly visualised in an image, or even
better, listened to, then we believe it lends the explanations greater usability and
trust. A broad schematic of our proposed two-level explanation method is shown
in Figure 5.1.
Given a trained audio to emotion model with mid-level intermediates (as in
Figure 4.2), we first obtain the mid-level explanations in terms of effects (see
Section 4.5.2) for an emotion prediction. We then choose a mid-level feature to be
explained further (every mid-level feature can be chosen, one at a time). Finally,
we obtain an interpretable decomposition from the input and use LIME to explain
the chosen mid-level feature in terms of the interpretable components. In our
work, we explored obtaining interpretable components from the input using two
methods:
models. We separate the input audio into five tracks – piano, bass, drums,
vocals, and other. This is a very intuitive decomposition of the input since
humans (to be precise, humans who have had prior exposure to the style of
music in question) naturally are able to perceive different characteristics of
different instruments in the music they listen to.
These decomposition methods are described in detail in Section 5.4 and Sec-
tion 5.5. Before delving into the decomposition methods, let us first understand
how LIME works.
We now give a brief description of this method, since we use this algorithm for
obtaining the second level explanations of our two-level explanation scheme.
Let the model under analysis be denoted by ĝ : Rd → R, which takes the
input x ∈ Rd (spectrograms with d pixels) and produces the prediction ĝ(x) (mid-
0
level features). Our aim is to explain the model prediction. We use x0 ∈ {0, 1}d
to denote a binary vector representing the interpretable version of x (typically,
d0 d). How this interpretable representation is derived from the original
input depends on the application, input type, and the intended use case (the
two methods that we propose for decomposing a music audio input into its
interpretable representations are given in Section 5.4 and Section 5.5).
Now, the aim of LIME is to find a surrogate model h ∈ H, where H is a class of
0
potentially interpretable models and {0, 1}d is the domain of h. For example H
could be a space of linear models or decision trees – models that are interpretable
by design. However, not all models in this space will be useful candidates for
interpretability. One of the desirable qualities of an explanation is that it should be
easy to understand by human users. A linear model with hundreds or thousands
of contributing features will not offer any more insight to a human than the
original black-box model. Therefore, a measure of complexity µC ( g, h, x) of an
explanation h ∈ H is introduced at this point. The complexity for a linear model
could be defined as the number of non-zero weights; for a decision tree, it could
be defined as the depth of the tree.
5.4 explanations via spectrogram segmentation 62
Let us denote a trained A2Mid2E model as FE := fˆ ◦ ĝ, which takes the spectro-
gram x ∈ Rd and produces two outputs: the mid-level feature predictions ĝ(x)
and the emotion predictions fˆ( ĝ(x)). We need LIME to find a surrogate model
ĥ ∈ H for ĝ. We restrict H to be the class of linear regression models.
We use the Felzenszwalb algorithm to segment the input x and obtain the binary
0
representation x0 ∈ {0, 1}d of the segmented input, where d0 is the number of
segments. Note that this results in a sample identical to the original input only if
all the segments are turned on (i.e. x0 = [1, 1, . . . , 1]). All other samples will be
perturbations of the original sample. In our experiments, we find that the most
satisfactory results are obtained by using the Python package skimage3 with the
parameters scale = 25 and min_size = 40. We generate 50,000 perturbations of
the input spectrogram and train a linear model h on the dataset Z generated by
the input (zi ) and output ( ĝ(zi )) pairs resulting from these perturbations.
The number of most important segments to be visualised in the final explana-
tion is a controllable parameter that the user can choose. In our case, we select it
automatically by thresholding on the p-value to weight ratio. For our experiments,
we observed a ratio of 10−6 to work well, which selects about 30 to 60 features
from a total of about 600.
The final output of the Audio-to-Mid explanation process are two spectrograms
that show the image segments with positive and negative weights, respectively –
in other words, those aspects of the spectrogram that most strongly contributed
to the prediction, in a positive or negative way. The other parts of the spectrogram
0 , x0 d0
are hidden. That is, we find the spectrogram masks xpos neg ∈ {0, 1} , with
m and n non-zero elements respectively, where m and n are obtained out of
the thresholding based selection of features mentioned above. In case of the
positive explanation, positive weights are chosen, and vice versa for the negative
explanation.
3 https://scikit-image.org/docs/dev/api/skimage.segmentation.html
5.4 explanations via spectrogram segmentation 64
(a) A sample spectrogram (top left) with different segmentation algorithms applied. Visually,
Felzenszwalb appears to capture meaningful segments in the spectrogram.
(b) Detail of the Felzenszwalb segmentation on the sample spectrogram. Some musically relevant
segments are indicated.
Original Spectrogram
Figure 5.3: Spectrogram segments with positive and negative effect for the prediction of
articulation. A modified spectrogram with weighted segments according to
the explanations is reconstructed for auralisation.
Importance Weight
0.020
0.010
0.015
0.010
0.005
0.005
0.000 0.000
0 200 400 600 0 200 400 600
Spectrogram Segment Spectrogram Segment
Figure 5.4: Explanations for “rhythm stability” and “melodiousness” on a test sample
that is constructed by concatenating two different musical pieces. The first
piece separately had a high predicted value of rhythm stability compared to
the second (0.75 vs 0.09) and the second piece separately had a high predicted
value of melodiousness compared to the first (0.56 vs −0.04).
5.4 explanations via spectrogram segmentation 67
Importance Weight
0.015
0.006
0.010
0.004
0.002 0.005
0.000 0.000
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Spectrogram Segment Spectrogram Segment
Figure 5.5: Explanations for “articulation” and “dissonance” on a test sample that is
constructed by concatenating two different musical pieces. The second piece
had a high predicted value of both articulation and dissonance compared to
the first (0.36 vs 0.02) and (0.51 vs 0.33) respectively.
5.5 explanations using sound sources 68
a solo classical guitar piece with held notes (noticeable as horizontal lines on the
spectrogram), and gives a high prediction for melodiousness. We observe that the
vertical spectrogram components are indeed highlighted in the explanation for
rhythm stability and the horizontal components are highlighted in the explanation
for melodiousness. In the second example, shown in Figure 5.5, the first song
is a choir performance with low articulation and dissonance than the second
song, which is an up-tempo multi-instrument jazz piece with drums and piano.
The explanation for articulation highlights the note onsets in the second song.
However, the explanation for dissonance is not as clear in this case.
Next, the explanations could be auralised, and we use the Griffin-Lim algorithm
[78] to deconvolve magnitude spectrograms and generate the corresponding au-
dio waveform, which can then be listened to. The positive/negative explanation
spectrograms can be auralised individually, but in order to improve the quality
and hear it in context, we merge them with the original spectrogram by ampli-
fying the spectrogram elements corresponding to the positive explanation and
attenuating the elements corresponding to the negative explanation. This gives us
our final listenable explanation for a mid-level feature prediction. Some examples
could be heard here: 0
As we can see, the explanation method described in this section has the po-
tential to increase our trust in the trained mid-level feature model (because the
explanations of mid-level feature predictions correspond with human expectation,
and are not arbitrary). However, there are some drawbacks. Firstly, the spectro-
gram decomposition is based on an image segmentation algorithm (as opposed to
a musically relevant decomposition) and does not always capture useful musical
features. Secondly, since the Griffin-Lim algorithm only gives an approximate
inversion of the magnitude spectrogram, the reconstruction quality is not high.
Thus, we next look at another way to decompose the input audio, which is more
musically inspired, although we lose some fine-graininess.
The audioLIME method [84] is based on the LIME framework described pre-
viously in Section 5.3 and extends its definition of locality for musical data by
using separated sound sources as the interpretable representation. This gives
audioLIME the ability to train on interpretable and listenable features. The key
insight of audioLIME is that interpretability with respect to audio data should
really mean listenability.
In order to generate the interpretable representation, the original input audio is
decomposed into its sources using one of the several available source separation
packages6 (we use Spleeter [88]). The source separation problem is formulated as
estimating a set of C sources, {S1 , ..., Sc }, when only given access to the mixture
M of which the sources are constituents. We note that this definition, as well
as audioLIME, is agnostic to the input type (waveform or spectrogram) of the
audio. We use these C estimated sources of an input audio as our interpretable
components, i.e. x0 ∈ {0, 1}C is the interpretable input representation. In our case,
C = {piano, drums, vocals, bass, other}. As in the case of spectrogram segments
of Section 5.4, a perturbed input z0 to the model will have some of the components
turned off. For example z0 = {0, 1, 0, 1, 0} results in a mixture only containing
estimates of the drums and the bass tracks. The relation of this approach to
the notion of locality as used in LIME lies in the fact that samples perturbed in
this way will in general still be perceptually similar to the original input (i.e.,
recognised by a human as referring to the same audio piece). This system is
shown in Figure 5.6.
In this case, since |C | = 5, the maximum possible perturbations for an input is
25 = 32, which is small enough to use the whole set of perturbations to obtain
the dataset Z for training the surrogate linear model h using LIME.
5 https://github.com/CPJKU/audioLIME
6 https://source-separation.github.io/tutorial/intro/open src projects.html
5.5 explanations using sound sources 70
Interpretable
decomposition
(stems)
Perturbed Samples
Source
Separation
Obtain Labels
Figure 5.6: AudioLIME schematic. The input audio is deconstructed into its component
instrument stems using an off-the-shelf source separation algorithm (we
use Spleeter [88]). The components are then permuted and mixed to give
us perturbed samples in the neighbourhood of the original sample. The
perturbed samples are then passed through the LIME pipeline to train a local
interpretable surrogate model. Image adapted from Haunschmid et al. [84]
(Section 5.6). There, we are going to encounter the DEAM and PMEmo datasets,
which are emotion datasets containing audio samples and arousal/valence an-
notations. For the analysis in the present section, we train the A2Mid2ERF-ResNet
model using either DEAM (notated as “D”), or PMEmo (notated as “P”) or both
(notated as “P+D”) for the emotion labels, and the usual Mid-level Features
dataset for the mid-level feature labels. We focus the analysis on explanations of
“rhythm stability”, as this is relevant for Section 5.6 as well. The fidelity and com-
plexity measures are computed for explanations generated on held-out samples
from either the DEAM (D), PMEmo (P), or both (P+D) datasets.
We can see in Figure 5.7a that the fidelity score (coefficient of determination,
or R2 -score) is relatively high across all combinations of train and test sets. The
median score is 0.86 across all explanations (median taken across all samples
and all mid-level features). This means that for 50% of the explanations, more
than 86% of the variation in the dependent variable (mid-level prediction) can be
predicted using the independent variables (instrument stems). Next, we look at
complexity. Figure 5.7b shows the computed complexities, compared to a random
baseline. For all train and test set combinations, the majority of explanations are
significantly less complex than the random baseline.
For auralisation, since the input components are listenable by themselves, we
do not require a separate deconvolution step to render the listenable waveforms.
One can simply play and listen to the sound source (which was extracted from
the original mix via the source separation algorithm) with the maximum weight.
The quality of the resulting audio sources are dependent on the source separation
algorithm. In the present case, Spleeter is able to separate the five sources (vocals,
5.6 model debugging: tracing back model bias to sound sources 71
Figure 5.7: Figure 5.7a shows the computed fidelity (coefficient of determination, or R2 -
score, between the predictions by the global model and the local model) scores
for the evaluated explanations. Figure 5.7b shows the complexity (entropy of a
distribution over the feature attribution weights, normalised by the entropy of
a uniform distribution) scores for the evaluated explanations. The green region
shows the standard deviation of complexities for 1000 random “explanations”,
with the black line being the mean. The tuples indicate the train set and the
test set, e.g. “(P+D, P)” means that the model was trained on the combined
PMEmo + DEAM dataset, and tested on held-out samples from the PMEmo
dataset.
piano, drums, bass, and other) resulting in five high quality stems for each
explanation.
particular genre, and how modifying the training data to balance that genre’s
representation leads to changes in the model predictions as well as model expla-
nations in a predictable fashion. To begin with, let us familiarise ourselves with the
datasets involved.
5.6.1 Datasets
As before, we have the Mid-level Features dataset [9] from which we obtain the
training audio and annotations for the mid-level part of our A2Mid2E model. For
the emotion part, we use the DEAM and PMEmo datasets, both of which contain
more samples than the Soundtracks dataset, which was used in Chapter 4 for
training the emotion part. The DEAM and PMEmo datasets contain audio and
ratings for arousal and valence.
• DEAM: Database for Emotional Analysis in Music: The DEAM dataset [10] is a
dataset of dynamic and static valence and arousal annotations. It contains
1,802 songs (58 full-length songs and 1,744 excerpts of 45 seconds) from
a variety of Western popular music genres (rock, pop, electronic, country,
jazz, etc). In our experiments, we use the static emotion annotations, which
are continuous values between 0 and 10.
• PMEmo: Popular Music with Emotional Annotation: The PMEmo dataset [184]
consists of 794 chorus clips from three different well-known music charts.
The songs were annotated by 457 annotators with valence and arousal
annotations separately for dynamic and static. In our experiments, we use
static labels, which are continuous values between 0 and 1.
5.6.2 Setup
Our experiment consists of training our models on the above two datasets. While
both datasets have arousal and valence annotations corresponding to audio clips,
the genre distributions of the two datasets are very different7 (see Figure 5.8,
which highlights the difference between the number of hiphop songs in the
datasets including the Mid-level Features dataset). Our aim is to check for bias
in a particular training scenario, and use explanations to verify change in model
behaviour upon changing the training scenario. The hypothesis is that a model
trained on the DEAM dataset, whose genre distribution is very different from the
PMEmo dataset, will exhibit some kind of bias when tested on the PMEmo dataset,
and the hope is that the mid-level based and sound-source based explanations
will help us understand biases in the musical components that ultimately lead
to the bias in the emotion predictions. For this experiment, the annotations from
both emotion datasets are scaled to be between 0 and 1, so that the annotations
from the two sources could be combined when required. The test set is a fixed
but randomly chosen subset of the PMEmo dataset.
7 To estimate the genre distribution of the datasets, we use a pre-trained music tagger model [149] to
predict genre tags for all the tracks in the three datasets (PMEmo, DEAM, and Mid-level Features
Dataset), since we do not have genre metadata for these datasets.
5.6 model debugging: tracing back model bias to sound sources 73
Table 5.1: Performance of explainable bottleneck models (A2Mid2E) compared with the
end-to-end counterparts (A2E) on different train/test dataset scenarios. All
models use the RF-ResNet backbone. For the A2Mid2E models, the Mid-level
Features dataset is used to train the mid-level part. R2 refers to the average
coefficient of determination, and r refers to the average Pearson correlation
coefficient. The cost of explanation (CoE) is also calculated: a positive cost
means that the bottleneck model (A2Mid2E) performs worse compared to
the end-to-end model (A2E), while a negative cost implies that the bottleneck
model performs better compared to the end-to-end model.
First, we train the A2Mid2ERF-ResNet model on the two datasets and compare
the performances with the corresponding end-to-end A2ERF-ResNet models (first
four rows in Table 5.1) to establish a baseline and verify that both the mid-level
features and emotions are being learnt. We also compute the cost of explanation
(CoE) on the R2 -scores for these training scenarios. A positive cost of explanation
means that the bottleneck model (A2Mid2E) performs worse compared to the end-
to-end model (A2E). A lower cost of explanation is desired as that indicates that
the mid-level bottleneck does not adversely affect the actual emotion prediction
performance. We observe that for both dataset scenarios – (D, D) and (P, P) – the
CoE is low. In fact, for the (D, D) case, the bottleneck model actually improves
performance, resulting in negative CoEs for both arousal and valence8 .
8 This may be because training with additional data (for training the mid-level layer) improves the
overall model performance in this case, given that the DEAM dataset and the Mid-level Dataset
share a common source for a portion of data (see Appendix a)
5.6 model debugging: tracing back model bias to sound sources 74
Figure 5.9: Fraction of hiphop songs in quantiles vs the mean valence error of each
quantile over PMEmo dataset (with model trained on DEAM)
When we take the A2Mid2E model trained only on DEAM and use it to predict
arousal and valence for the entire PMEmo dataset, we observe that the error in
valence shows a pattern – overestimations of valence primarily occur in hiphop
songs, as shown in Figure 5.9. We can reason about relatively poor performance
for hiphop songs based on the discrepancy between the training and testing sets
in terms of genre composition. In Figure 5.8, we can see that PMEmo has a large
percentage of hiphop songs whereas both DEAM and Mid-level datasets have
a small percentage. Since our model has not seen enough hiphop songs during
training, it is to be expected that it does not perform well when it encounters
hiphop during test. However, a question that is pertinent next is – what is it about
hiphop songs that makes our model overestimate their valence?
0.4
0.2
relative contribution
0.0
0.2
0.4
rhythm_complexity
rhythm_stability
minorness
articulation
tonal_stability
melody
dissonance
(a) Trained on DEAM
Figure 5.10: Relative effects of the mid-level features for valence prediction for two models
trained on different datasets (only DEAM or DEAM+PMEmo), and tested
on the same fixed subset of hiphop songs from the PMEmo dataset.
Once we have selected a mid-level feature as having the most positive relative
effect on the valence, we would like to understand what musical constituents in
the input can explain that mid-level feature. To do this, we use audioLIME and
generate source based explanations for rhythm stability. The sources available in
the current implementation of audioLIME are vocals, drums, bass, piano, and
other. From the PMEmo dataset, we take the top-50 valence errors in songs tagged
as “hiphop”, and compute the explaining source for rhythm stability. We do the
same for songs tagged as “pop”. Looking at Figure 5.11a, we see that vocals are
a major contributing source for the rhythm stability predictions for the hiphop
songs. Compare this to the results for pop songs (Figure 5.11b), where drums are
(not surprisingly) the dominant contributing source of rhythm stability, although
vocals still seem to be important.
5.6 model debugging: tracing back model bias to sound sources 76
30 30
25 25
Number of samples
Number of samples
20 20
15 15
10 10
5 5
0 0
vocals other piano drums bass vocals other piano drums bass
Explaining source Explaining source
(a) (b)
Figure 5.11: Explaining sources for rhythm stability in songs with top-50 valence errors
for songs tagged with (a) “hiphop”, and (b) “pop”.
Bringing together our two types of explanations, we can reason that the high
valence predictions for hiphop songs is due to overestimation of rhythm stability,
which, in this case, can be attributed to the vocals. While there is a lot of diversity
in the style of rapping (the form of vocal delivery predominant in hiphop), it
has been noted that rappers typically use stressed syllables and vocal onsets to
match the vocals with the underlying rhythmic pulse [4, 138]. These rhythmic
characteristics of vocal delivery (that constitute “flow”, and may add metrical
layers on top of the beat) contribute strongly to the rhythmic feel of a song. The
positive or negative emotion of hiphop songs is mostly contained in the lyrics
– the style of vocal performance does not necessarily express or correlate with
this aspect of emotion. Therefore, it makes sense that a model which has seen
few examples of hiphop during training should wrongly associate the prominent
rhythmic vocals of hiphop to high rhythm stability and in turn high valence.
A model that has been trained with hiphop songs included, we expect, would
place less importance on rhythm stability for the prediction of valence, even if
the vocals might still contribute significantly to rhythm stability. Thus, we expect
the relative effect of rhythm stability for valence to decrease in such a model.
This is exactly what we observe on a model trained with the combined
PMEmo+DEAM dataset (remember that the PMEmo dataset contains a higher
proportion of hiphop songs). The average relative effects are shown in Figure 5.10b
and we can see that the relative effect of rhythm stability has decreased while
those of minorness, melody, and tonal stability have increased. Thus, the model
changed in a way that was in line with what we expected from the analysis of
our two-level explanation method.
Looking at mean overestimations (Figure 5.12) in valence for hiphop and other
genres for models trained on DEAM and PMEmo+DEAM shows that valence
overestimations of hiphop songs have decreased substantially, without changing
the valence overestimations on other genres. The overall test set performance
improves (as expected) for the model trained on the PMEmo+DEAM train set.
5.7 conclusion 77
Figure 5.12: Mean valence overestimations for two models trained on different datasets,
but tested on the same fixed subset of the PMEmo dataset.
The model trained only on DEAM and tested on the PMEmo test set gives R2 -
scores of 0.44 for arousal and 0.25 for valence, while the model trained on the
PMEmo+DEAM combined train set gives R2 -scores of 0.64 for arousal and 0.47
for valence (see Table 5.1).
5.7 conclusion
In this chapter, we proposed a method to explain a mid-level feature model
using an interpretable decomposition of the input. In the context of this thesis,
this method is intended to be used in combination with the bottleneck-based
explanations of music emotion predictions (of Chapter 4), together making a
two-level hierarchical explanation pipeline. However, this method is generic, and
could possibly be used in a standalone fashion as well.
Considering the pipeline, first explanations of music emotion predictions are
derived using mid-level features as explanatory variables. These mid-level predic-
tions are then further explained using components from the actual input, selected
using a post-hoc explanation method. We use LIME and audioLIME to do this.
LIME (and audioLIME) trains an interpretable surrogate model using a dataset of
perturbed samples of the current sample to be explained. This surrogate model
approximates the local behaviour of the mid-level predictor in the neighbourhood
of that particular sample. LIME is used when the input is decomposed into spec-
trogram segments, and audioLIME is used when the input audio is decomposed
into instrument stems using a music source separation algorithm.
We also demonstrated a potential application of this method as a tool for model
debugging and verifying model behaviour. The explanations provided a way to
qualitatively verify expected change in model behaviour upon switching from an
unbalanced/skewed dataset to a more balanced one.
6
T R A N S F E R : M I D - L E V E L F E AT U R E S F O R
P I A N O M U S I C V I A D O M A I N A D A P TAT I O N
1 Musical score, or sheet music, is a handwritten or printed form of musical notation that uses
musical symbols to indicate the pitches, rhythms, or chords of a song or instrumental musical
piece.
78
6.1 the domain mismatch problem 79
in Section 4.4, and will again see in Chapter 7), and 2) these features are few in
number and have intuitive musical relevance, making them easy to interpret.
However, since the mid-level features are trained from data, the training data
distribution impacts the generalisation of the model. We saw an example of this in
Chapter 5 where lack of examples from the hip-hop genre led to a music emotion
model overestimating the valence of songs from this genre in the test set. What
can we expect when a mid-level model trained on the Mid-level Features dataset
[9] is used to predict the mid-level features for the solo piano music that we want
to study? How can we transfer a model from the training data domain to the solo
piano domain? We will answer these questions in the present chapter.
The rest of this chapter is broadly based on the following publication:
Figure 6.1: Genre distribution of Mid-level Dataset according to genre tags predicted
using the pre-trained tagging model “musicnn” [149].
The work presented in this chapter is motivated by the need to reduce the
aforementioned domain discrepancy with the goal of ultimately enabling transfer
of the mid-level feature model to solo piano recordings. We employ several
techniques to achieve this. First, in Section 6.4.1 we see how the receptive-field
regularised model (previously encountered in Section 4.2.3) results in improved
generalisation compared to a VGG-ish model. Next, we use a domain adaptation
(DA) approach to reduce the discrepancy between the representations learned by
the model for the source and target domains, thus improving the performance on
the target domain. As we will see in Section 6.4.2, we need to use an unsupervised
domain adaptation approach, since we do not have a large set of labelled examples
of solo piano recordings to learn a supervised transfer scheme from. Finally, we
refine our domain adaptation pipeline by introducing an “ensembled self-training”
procedure, i.e., we use an ensemble of domain-adapted teacher models to train
a student model that performs better on the target domain than any of the
individual teacher models separately.
To put our domain adaptation pipeline to the test, we apply it to transfer a
mid-level model to audio from the Con Espressione Game2 , which is a part of a
large project3 aimed at studying the elusive concept of expressivity in music with
computational and, specifically, machine learning methods [177]. The data from
this game relates to personal descriptions of perceived expressive qualities in
performances of the same pieces by different pianists. Can the mid-level features
be used to learn and model such subjective descriptions of piano performances?
In Section 6.5.1, we find that a domain-adapted mid-level feature model indeed
improves in performance at the task of modelling perceived expressivity dimen-
2 http://con-espressione.cp.jku.at/short/
3 https://www.jku.at/en/institute-of-computational-perception/research/projects/
con-espressione/
6.2 domain adaptation: what is it? 81
What kind of domain shift can we expect in our mid-level feature learning
case? We will take a closer look at our mid-level feature data distribution in
Section 6.3 to answer this question, and we will see that the shift between piano
and non-piano subsets of the data can be considered as a covariate shift.
Most domain adaptation strategies typically aim at learning a model from the
source labelled data that can be generalised to a target domain by minimising
the difference between domain distributions. Given a source domain S(x, y) =
{(xi , yi )}i ∼ ps (x, y) and a target domain T (x, y) = {(x j , y j )} j ∼ pt (x, y), the
difference between the domain distributions could be measured by using one
of several metrics such as Kullback-Leibler divergence [112], Maximum Mean
Discrepancy (MMD) [77], and Wasserstein metric [115]. In unsupervised domain
adaptation where the labels are not available in the target domain, u is unknown.
The domain adaptation literature is quite rich and several approaches have
been proposed for both supervised and unsupervised scenarios for structured as
well as unstructured data. Giving an exhaustive overview of these methods is out
of scope for this thesis, however the interested reader is encouraged to refer to
some of the following works. Ben-David et al. [22] provides a good introduction to
the theoretical aspects of domain adaptation, which can be followed by Ben-David
et al. [24] and Ben-David et al. [23] as further reading. Zhang [185] provides a
comprehensive survey of unsupervised domain adaptation for visual recognition.
Domain generalisation and domain adaptation are both discussed in detail in
Wang et al. [172]. For domain adaptation on audio-related data, there exists some
work in the area of acoustic scene classification [1, 73], speech recognition [102],
and speaker verification [44]. For now, it might be worth delving slightly deeper
into one class of methods – adversarial domain adaptation for learning invariant
representations – since we will use this method for transferring our mid-level
model to the solo piano domain.
e T ( h ) ≤ eS ( h ) + d ( p s , p t ) + λ ∗ (6.1)
where λ∗ = infh∈H (eS (h) + eT (h)) is the optimal joint error achievable on both
domains, and d( ps , pt ) is a discrepancy measure between the source and target
distributions. In other words, the above equation states that the target risk is
bounded by three terms: the source risk (the first term), the distance between the
marginal data distributions of the source and target domains (the second term
in the bound), and the optimal joint error achievable on both domains (the third
term in the bound). The interpretation of the bound is as follows: if there exists a
hypothesis that works well on both domains, then in order to minimise the target
risk, one should choose a hypothesis that minimises the source risk while at the
same time aligning the source and target data distributions.
To measure the alignment between two domains S and T , it is crucial to
empirically compute the distance d(S , T ) between them. To this end, Ben-David
et al. [23] proposed several theoretical distance measures for distributions (H ∆H-
divergence, A-distance). The H ∆H-divergence can be estimated empirically by
computing the model-induced divergence. To do this, we calculate the distance
between the induced source and target data distributions on the representa-
tion space Z h1 formed by a model h1 . This allows one to estimate the domain
divergence from unlabelled data as well.
The upshot of Equation 6.1 is that in order to minimise the risk on the target
domain, we would like to learn a parametrised feature transformation h1 : X 7→ Z
such that the induced source and target distributions (on Z ) are close, as measured
by the H-divergence4 , and at the same time, ensuring that the learnt feature
4 Zhao et al. [186] suggest that learning an invariant representation and achieving a small source
error is not enough to guarantee target generalisation in a classification domain adaptation task.
They propose additional bounds that translate to sufficient and necessary conditions for the success
of adaptation. However for our present context, we continue with this method since we achieve
successful domain adaptation in practice.
6.2 domain adaptation: what is it? 84
transformations are useful for the actual prediction task. The transformation h1
is called an invariant representation w.r.t. H if dH (DSh1 , DTh1 ) = 0, where DSh1 and
DTh1 are the induced source and target distributions respectively. Depending on
the application, one may also seek to find a hypothesis (on the representation
space Z h1 ) that achieves a small empirical error on the source domain.
The above discussion suggests that for effective domain transfer to be achieved,
predictions must be made based on features that cannot discriminate between the
training (source) and test (target) domains. How can we force a model to learn
such domain invariant features? It turns out that an inspiration can be taken from
the dynamics of training Generative Adversarial Networks (GANs) [75].
GANs are deep learning based models that learn to generate realistic samples
of the data they are trained on by learning the distribution of the training data.
They are composed of two sub-networks – a generator G and a discriminator D –
that compete with each other in a two-player game. The objective of the generator
is to map a random input (“noise”) to a point in the underlying distribution of the
data while the objective of the discriminator is to predict whether the input to it
comes from the actual training data, or is generated by the generator. This training
setup creates an adversarial minimax training dynamic with the discriminator
being the generator’s “adversary”. One consequence of this adversarial training
is that the generator is forced to align the distribution of its output to that of the
underlying training data, thus maximising the discriminator’s error.
The success of adversarial learning as a powerful method of learning and
aligning distributions has motivated researchers to apply it in the context of
domain adaptation by using it to learn invariant representations. The idea is that
adversarial training can be used to minimise the distribution discrepancy between
the source and target domains to obtain transferable and domain invariant
features.
One specific method, proposed in Ganin and Lempitsky [71] and analysed
in detail in Ganin et al. [72], introduces an additional discriminator branch
to a neural network, which is connected to the main network via a gradient
reversal layer. The reversal layer essentially reverses the training signal from
the discriminator, hence forcing the part of the network feeding into it to learn
features that confuse the discriminator, while the discriminator itself gets better
at distinguishing between the two domains. As the training progresses, the
approach promotes the emergence of features that are (i) discriminative for the
main learning task on the source domain and (ii) indiscriminate with respect to
the shift between the domains. This technique does not require any labelled data
from the target domain and hence is an unsupervised domain adaptation method.
We will describe this method in more detail in Section 6.4.2.
6.3 visualising the domain shift 85
0.4
0.3
Density
0.2
0.1
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
melodiousness articulation rhythm_complexity rhythm_stability
0.5 KL(p||n) = 0.59 KL(p||n) = 0.07 KL(p||n) = 0.10
0.4
0.3
Density
piano
0.2 non-piano
0.1
0.0
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
dissonance tonal_stability minorness
Figure 6.2: Distribution of annotations in the Mid-level Features Dataset for two domains:
piano and non-piano.
While we have a greatly unequal number of samples from the two subsets (only
194 in the piano set compared to 4806 in the non-piano set), it might still be
instructive to plot the label histograms for the mid-level features corresponding
to the two subsets. This will let us visualise the approximate prior shift (or label
shift, or target shift) present in the Mid-level Features dataset between piano
and non-piano instances. The label distributions are shown in Figure 6.2. The
Kullback–Leibler divergences of the piano set label distributions from the non-
piano set label distributions, KL( p||n), are also calculated. We can see that there is
minimal distribution mismatch in the ratings of melodiousness, rhythm complex-
ity, tonal stability, and minorness. Articulation, rhythm stability, and dissonance
show more apparent distribution shifts, however, there is still substantial overlap
between the distributions.
As a quick visual inspection of the input data, it is helpful to look at the spectro-
grams of samples picked at random from the piano and non-piano subsets of
the Mid-level Features dataset. Figure 6.3 shows this. We can see that the spec-
trograms coming from the piano subset are distinct from the Mid-level dataset.
Some of the features that are easily distinguishable visually are the absence of
high-frequency content and “vertical” lines (typically indicative of percussive
sounds) in the solo piano music, while having more stable “horizontal” lines
(typically indicative of pitched sounds).
It might be more useful to look at the input data distributions, as we did
for label distributions. Recall that covariate shift means that the marginal input
distributions of the two domains are different (ps (x) 6= pt (x)). However, note
that in our case, the dimensionality of the input space is huge (equal to the total
number of pixels in each spectrogram image). How do we plot and visualise
the marginal distributions in this case? One idea is to transform the input data
samples to obtain embeddings and then to project these embeddings on a two-
dimensional plane using a distance-based projection such as t-SNE [124]. To
do this, we train a mid-level feature model using the RF-ResNet model from
Chapter 4 on the source domain so that the model learns transformations relevant
to predicting mid-level features from spectrograms. Embeddings of size 512 from
the second-to-last layer of this model are then extracted for all the piano samples
and a random selection of the non-piano subset of the Mid-level Features dataset.
Samples taken from the MAESTRO dataset [85] (see Section a.4 in Appendix a
for a brief description of the dataset) that contains solo piano recordings are also
transformed in this way and combined with the embeddings from the Mid-level
dataset. This matrix of embeddings is then projected on a 2-D space using t-SNE,
shown in Figure 6.4.
Looking at the distribution of the embeddings projected with t-SNE validates
our suspicion that solo piano music indeed forms a cluster that is distinct and
shifted from the cluster formed by non-piano samples. This points to the presence
of covariate shift between piano and non-piano samples.
10852
3229
Frequency (Hz)
958
32
10852
3229
Frequency (Hz)
958
32
Time
3229
Frequency (Hz)
958
32
10852
3229
Frequency (Hz)
958
32
Time
Figure 6.3: Spectrograms for (a) non-piano recordings and (b) solo piano recordings,
shown for visual inspection. The difference in spectrogram features for the
two domains is apparent, for instance piano samples lack very high frequency
content, percussive “vertical” spectral elements, and vocal formants.
Methods such as importance weighting for adapting to label shift require labelled
instances [110]. Instead, we will treat our present problem as a covariate shift
problem without target labels, and explore an unsupervised domain adaptation
approach.
We bridge the domain gap in three steps. In the first step, we verify the
importance of using regularisation to improve model generalisation. Models that
generalise well to minority classes could be expected to perform well for shifted
6.4 bridging the domain gap 88
40
30
20
10
10
20
30
Figure 6.4: t-SNE plot of samples drawn from the piano and non-piano subsets of the
Mid-level Features dataset. Samples drawn from the MAESTRO dataset are
also shown. All the samples are passed through a trained mid-level features
model (trained on the source domain) and the embeddings of size 512 are
extracted. The t-SNE is then applied to this matrix of embeddings.
domains (as long as the shift is not severe). Therefore, using such a model for any
subsequent domain adaptation strategies can potentially improve our results in
terms of target domain performance.
The next step is the main domain adaptation step. We choose an unsupervised
domain adaptation approach to reduce the sensitivity of our model to covariate
shift by learning a feature space invariant to domain shift.
The third step is a refinement of the unsupervised domain adaptation process to
further boost performance using a self-training method. These steps are described
in detail in the following sub-sections.
In Chapter 4 we have already seen better performance of this model for predicting
mid-level feature values, compared to a VGG-ish model. Here, we describe the
model in detail, and compare its performance with the VGG-ish model towards
improving performance on the target domain.
Residual Neural Networks (or “ResNets”) were introduced by He et al. [87]
to address the vanishing gradient problem of very deep neural networks. The
vanishing gradient problem refers to how the magnitudes of error gradients di-
minish across layers when backpropagating during the training process, resulting
in progressively slower fitting of the parameters the further they are from the
output layer. ResNets overcome this problem by introducing skip connections
between layers through which the gradients can flow without getting diminished.
ResNets have shown great promise in large scale image recognition tasks.
However, previous research has shown that the performance of vanilla ResNets
on the audio domain [108] is poorer in comparison and thus, until recently, most
deep networks used in the audio domain were built on the VGG-ish architecture
[51]. One of the reasons for this, as identified by Koutini et al. [108], is that the
deeper a convolutional model is, the larger its receptive field on the input plane.
Receptive field refers to the total effective area of the input image that is “seen”
by the output neurons, and is affected by factors such as the filter size, stride,
and dilation of all the layers that precede the output. This is in contrast to a fully
connected architecture, where each neuron is affected by the whole input. The
maximum receptive field of a model employing convolutional layers is given by
the following equation:
S n = S n −1 s n
(6.2)
RFn = RFn−1 + (k n − 1)Sn
where sn , k n are the stride and kernel size, respectively, of layer n, and Sn , RFn
are the cumulative stride and receptive field, respectively, of a unit from layer n
to the network input. While this gives us the maximum receptive field, it bounds
the effective receptive field (or what the output “actually sees”) which could be
lower and can be computed using the gradient-based method in [123]. Since
ResNets can be made much deeper than VGG-ish models without compromising
the training process, this also increases their maximum receptive field, resulting
in a greater possibility of overfitting, particularly when training data is in limited
quantity.
Therefore, as a first step towards improving out-of-domain generalisation of
mid-level feature prediction, we evaluate the performance of a Receptive-Field
Regularised ResNet (RF-ResNet). Our baseline will be the performance of the
VGG-ish model from Chapter 4 for the same task. The regularisation of the
receptive field in the RF-ResNet is done through the following methods, as given
in Koutini et al. [108]:
In this step, we move from domain generalisation to domain adaptation for our
specific target domain. Unsupervised adaptation for covariate shift has been
researched extensively in the machine learning and statistics literature. Going
through the whole host of possible methods [185] is beyond the scope of this
thesis, and thus we select the most practically suitable method for our use case,
based on preliminary trial experiments.
We adopt the reverse-gradient method introduced in Ganin and Lempitsky [71],
which achieves domain invariance by adversarially training a domain discrimi-
nator attached to the network being adapted, using a gradient reversal layer. The
procedure requires a large unlabelled dataset of the target domain in addition to
the labelled source data. The discriminator tries to learn discriminative features of
the two domains but due to the gradient reversal layer between it and the feature
extracting part of the network, the model learns to extract domain-invariant
features from the inputs.
We now provide a brief formal description of this procedure, paraphrased
from Section 3 of Ganin and Lempitsky’s paper [71]. We have input samples
x ∈ X and corresponding outputs y ∈ Y coming from either of two domains:
S = {(xi , yi )}i ∼ ps (x, y), the source domain, and T = {(x j , y j )} j ∼ pt (x, y), the
target domain, both defined on X × Y. The distributions ps and pt are unknown
and are assumed to be similar but separated by a domain shift. Additionally, we
also have the domain label di for each input xi , which is a binary variable that
indicates whether xi comes from the source distribution (xi ∼ ps (x) if di = 0) or
from the target distribution (xi ∼ pt (x) if di = 1). During training, the ground
truth values (yi ) of samples coming from only the source dataset are known, while
the domain indicator values (di ) of samples coming from both the source and
target datasets are known (by definition). During test time, we want to predict
the task values (mid-level features in our case) for the samples coming from the
target domain.
6.4 bridging the domain gap 91
Loss
task
prediction
features domain
prediction
grad
reverse Loss
task regressor
backprop
domain classifier
Figure 6.5: Unsupervised domain adaptation using reverse gradient method. Schematic
adapted from Ganin and Lempitsky [71].
classifier that minimise the loss of the domain classifier. The system thus forms an
adversarial training scheme.
Together with the task prediction loss, this procedure can be expressed as an
optimisation of the functional:
N N
E(θ f , θy , θd ) = ∑ Liy (θ f , θy ) − λ ∑ Lid (θ f , θd ) (6.3)
i =1 i =1
d i =0
where Liy is the loss function for label prediction and Lid is the loss function for
the domain classification, both evaluated at the i-th training example. We seek
the parameters θ̂ f , θ̂y , θ̂d given by the following:
It can be shown that in order to achieve this, θ f needs to be updated with the
∂Liy ∂Li ∂Liy
gradient ∂θ f − λ ∂θ df , while θy and θd update with their usual gradients ∂θy and
∂Lid
∂θd respectively.
How does one optimise the feature extractor with the task regressor and domain
classifier derivatives combined in this fashion? This is done by introducing a
parameter-free gradient reversal layer between the domain classifier and the feature
extractor that simply multiplies the gradient flowing from the domain classifier
by a negative factor (−λ) during the backward pass, resulting in the combined
derivative as required. The gradient reversal layer can be implemented easily in
any stochastic gradient descent framework. After training, only the task predictor
branch of the network is used to generate predictions for the test dataset. For
a more in-depth analysis and this method’s relation to the H ∆H-distance [22],
the reader is encouraged to refer to Ganin and Lempitsky [71]. A diagrammatic
representation of the architecture is given in Figure 6.5.
In the third and final step on our path to extract mid-level features from piano
music, we aim to further refine our (already) domain adaptated model using
a self-training scheme, aimed at reducing the variabilty of models from the
previous step. To do this, we train multiple domain-adapted models using the
unsupervised DA method described in Step 2 and use these as teacher models to
assign pseudo-labels to an unlabelled piano dataset. Before the pseudo-labelling
step, we select the best performing teacher models with the validation set. Even
though the validation set contains data from the source domain, this step ensures
that models with relatively lower variance are used as teachers. This helps filter
out the particularly poorly adapted models from the previous step, which may
occur due to the inherently less stable nature of adversarial training methods [42].
6.4 bridging the domain gap 93
After selecting a number of teacher models (in our experiments, we used four),
we label a randomly selected subset of our unlabelled dataset using predictions
aggregated by taking the average. This pseudo-labelled dataset is combined with
the original labelled source dataset to train the student model. We observed
that the performance on the test set, which comes from the target domain (the
experimental setup is explained in the next section), increased until the pseudo-
labelled dataset was about 10% of the labelled source dataset in size, after which
it saturated.
The teacher-student scheme allows the collective “knowledge” of an ensemble
of adapted networks to be distilled into a single student network. The idea of
knowledge distillation, which was originally introduced for model compression
in Hinton et al. [90], has been used for domain adaptation in a supervised setting
previously in Asami et al. [15]. The distillation process functions as a regulariser
resulting in a student model with better generalisability than any of the individual
6.5 experiments and results 94
• In Step 2, the recordings from the MAESTRO dataset are split into 15-second
segments and a random subset with the number of samples equal to that in
the (“non-piano”/source) training set is sampled on each run. The model
is trained using the backpropagation method of domain adaptation as
described previously. Better convergence is obtained by gradually ramping
up, over 20 epochs, the amount of reversed gradient that passes to the
feature extractor from the discriminator branch.
• In Step 3, a random subset of 500 (10% of the size of the Mid-level Features
dataset) recordings is sampled from the MAESTRO dataset to be pseudo-
labelled by the teacher models trained in Step 2. We use 4 teacher models
to pseudo-label the unlabelled piano recordings (the predictions from the
6.5 experiments and results 95
0.50
(12, 1, 1) (12, 3, 3) RF-ResNet
(12, 3, 1) ResNet
0.48 (20, 3, 1) VGG-ish
Solo piano test set performance
(Pearson correlation coefficient)
(20, 1, 1)
0.46
ResNet50
0.44
VGG-ish
0.42
ResNet18
0.40
ResNet34
0.38
100 2 3 4 5 101 20 30
Number of parameters (×106)
Figure 6.7: Performance of different model architectures. The tuples for the RF-ResNet
models are to be read as (number of layers, kernel size of stage 2
blocks, kernel size of stage 3 blocks) . We select the (12, 1, 1) variant
for further domain adaptation steps.
teacher models are averaged for each recording and each mid-level feature).
The pseudo-labelled samples are then combined with the original source
dataset, and the final student model (RF-ResNet DA+ST) is trained with
this combined dataset.
1 1
D (S0 , T 0 ; φ) =
m ∑ 0 φ( x ) − n ∑ 0 φ( x ) (6.6)
x ∈S x∈T 2
0.70
non-piano
piano 0.64
0.65 0.64
0.45 0.43
0.40
VGG-ish RF-ResNet RF-ResNet RF-ResNet
DA DA+ST
Figure 6.8: Summary of the performance of mid-level feature models on non-piano and
piano test sets, with progressive steps of domain generalisation (RF-ResNet),
adaptation (DA), and self-training refinement (DA+ST).
0.9
non-piano, no DA
0.8 piano, no DA
Pearson Correlation Coefficient
piano, DA+ST
0.7
0.6
0.5
0.4
0.3
0.2
melodiousness articulation rhythm rhythm dissonance tonal minorness
complexity stability stability
Figure 6.9: Performances for each mid-level feature, compared for the piano and non-
piano test sets, for a domain adapted and refined RF-ResNet model (DA+ST)
and an RF-ResNet model without any domain adaptation (no DA).
of training, meaning that the model learns invariant feature transformations from
early on in the training process. Computing the final discrepancies for the VGG-
ish model and after domain adaptation steps, we observe that the discrepancy
decreases for each step (Figure 6.11), justifying our three-step approach and
explaining the improvement in performance. In Figure 6.12, embeddings of piano
and non-piano samples from the representation space of a domain adapted model,
projected using t-SNE on a 2-D plane, are shown. We can see that in this case,
samples from both domains are mapped to overlapping regions. Compare this to
the case without domain adaptation, shown earlier in Figure 6.4.
DA=False
0.8 DA=True
Discrepancy 0.6
0.4
0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Epoch
Figure 6.10: Mean discrepancy between piano and non-piano domains over a training run
(averaged across multiple runs; shaded areas indicate standard deviation).
We see that for the entire duration of training, domain adaptation keeps the
discrepancy between the two domains lower than the run without domain
adaptation.
expressive character of piano recordings available from the Con Espressione Dataset
[37] using mid-level features predicted using our models. In the Con Espressione
Game, participants listened to extracts from recordings of selected solo piano
pieces (by composers such as Bach, Mozart, Beethoven, Schumann, Liszt, Brahms)
by a variety of different famous pianists (for details, see [37]) and were asked to
describe, in free-text format, the expressive character of each performance. Typical
characterisations that came up were adjectives like “cold”, “playful”, “dynamic”,
“passionate”, “gentle”, “romantic”, “mechanical”, “delicate”, etc. From these
textual descriptors, the authors obtained, by statistical analysis of the occurrence
matrix of the descriptors, four underlying continuous expressive dimensions
along which the performances can be placed. These are the (numeric) target
dimensions that we wish to predict via the route of mid-level features predicted
from the audio recordings.
We investigate whether our domain-adapted models can indeed predict better
mid-level features for modelling the expressive descriptor embeddings of the
Con Espressione dataset. We do this by predicting the average mid-level features
(averaged over the temporal axis) for each performance using our models and
training a simple linear regression model on these features to fit the four em-
bedding dimensions. Even though this is a very abstract task, for a variety of
reasons – the noisy and varied nature of the human descriptions; the weak nature
of the numeric dimensions gained from these; the complex and subjective nature
of expressive music performance – it can be seen (Table 6.1) that the features
predicted using domain-adapted models give comparatively better R2 -scores for
all four dimensions.
In Table 6.2, we take a closer look at Dimension 1 – the one that came out
most clearly in the statistical analysis of the user responses and was characterized
6.5 experiments and results 98
0.26
Mean Discrepancy
0.24
0.22
0.20
0.18
0.16
VGG-ish RF-ResNet RF-ResNet RF-ResNet
DA DA+ST
Figure 6.11: Mean discrepancy between piano and non-piano domains for the different
domain adaptation steps. Vertical bars indicate standard deviation over eight
runs.
Table 6.2: Pearson correlation (r) for mid-level features with the first description embed-
ding dimension, with (right) and without (left) domain adaptation. Features
with p < 0.05 and |r | > 0.20 are selected. This dimension has positive loadings
for words like “hectic”, “irregular”, and negative loadings for words like “sad”,
“gentle”, “tender”.
by descriptions like ranging from “hectic” and “agitated” on one end to “calm”
and “tender” on the other [37] (and also the dimension that is best predicted
by our models). Looking at the individual mid-level features, we find that, first
of all, the predicted features that show a strong correlation with this dimension
do indeed make sense: one would expect articulated ways of playing (e.g., with
strong staccato) and rhythmically complex or uneven playing to be associated
6.6 conclusion 99
40
30
20
10
10
20
30
Figure 6.12: Embeddings of piano and non-piano samples from the representation space
of a domain adapted model, projected using t-SNE on a 2-D plane.
6.6 conclusion
We began this chapter with the problem of domain shift in case of piano versus
non-piano music recordings in the Mid-level Feature dataset. We are interested in
closing this domain gap in our mid-level feature models because we are ultimately
interested in studying emotional variation in piano performances, with the hope
of capturing subtle expressive differences between different performances, and in
the process disentangling the musical factors underlying such emotional variation.
In order to apply a mid-level feature model to solo piano music, it is necessary to
ensure that our model works as expected on solo piano music.
6.6 conclusion 100
101
7.1 the data: bach’s well-tempered clavier 102
1 Actually, attack rate as computed by B&S is also informed by the average tempo of the performance;
thus, it is not strictly a score-only feature.
2 A Prelude is a short piece of music, typically serving as an introduction to succeeding and more
complex parts of the musical work. A fugue is a piece of music composed using two or more voices
playing a theme in counterpoint and recurring through the course of the composition.
7.1 the data: bach’s well-tempered clavier 103
most important works. We choose to use the first set (book I) in our experiments
to study the variation of emotional expression across different performances.
The Well-Tempered Clavier is ideally suited for systematic and controlled studies
of this kind, as it comprises a stylistically coherent set of keyboard pieces from a
particular period, evenly distributed over all keys and major/minor modes, with
a pair of two pieces (a prelude, followed by a fugue) in each of the 24 possible
keys, for a total of 48 pieces. Each piece has its own distinctive musical character,
and despite being written in a rather strict style and not meant to be played in
‘romantic’ ways, the lack of composer-specified tempi offers pianists (or pianists
do take) lots of liberties in the choice of tempo. For example, there are pieces in
our set of recordings that one pianist plays more than twice (!) as fast as another.
This set of pieces is also not overt in its intended emotion, leading to performers
taking greater interpretative freedom in choosing ornamentation and distinctive
style [21].
For a broad set of diverse performances, we selected six recordings of the
complete WTC Book 1, by six famous and highly respected pianists, all of whom
can be considered Bach specialists to various degrees. The recordings are listed in
Table 7.1.
In accordance with B&S, we only use the first 8 bars of each recording for
the annotation process and our experiments. These were cut out manually. We
collected the arousal and valence annotations for the 288 excerpts by recruiting
participants in a listening and rating exercise. The participants of our annotation
exercise were students of a course at a university, without a specifically musical
background. Each participant heard a subset of the recordings (all 48 pieces as
played by one pianist) and was asked to rate the valence on a scale of −5 to +5
(increments of 1; a total of eleven levels) and the arousal on a scale of 0 to 100
(increments of 10; a total of eleven levels). They could listen to a recording as
7.2 feature sets for emotion modelling 104
Figure 7.1: Distribution of arousal and valence ratings for all 48 pieces. The spread is
across the 6 performances for each piece. For better comparison, the ratings
were standardised to zero mean and unit variance before plotting.
many times as they liked. Each recording was rated by 29 participants. In total,
we collected 8,352 valence-arousal annotation pairs.
For the purposes of the experiments to be described here, we take the mean
arousal and mean valence ratings for each recording, and these values serve as
our ground-truth values for all following experiments. The distributions (over the
6 performances) of these mean ratings for each piece are visualised as boxplots in
Figure 7.1. In Figure 7.2, the mean arousal and valence annotation for each of the
288 recordings is plotted on the arousal-valence plane.
100
80
60
Arousal
40
20
r = 0.55
0
4 2 0 2 4
Valence
Figure 7.2: All annotations (mean values) for the 288 recording excerpts of the 48 pieces
in Bach’s Well-Tempered Clavier Book I. We observe a Pearson correlation
coefficient between the arousal and valence annotations of 0.55.
from the music notation and not from the audio content. These features are
thus independent of the performer. The third set of features are our mid-level
features, learnt using the deep architecture explained earlier in Section 4.2, and
domain adapted for piano music using the unsupervised domain adaptation and
self-training refinement of Chapter 6. To have a fair comparison with the deep
learning based mid-level features, for the fourth set of features, we use features
extracted from an identical deep model trained end-to-end on the DEAM [10]
dataset to predict arousal and valence. Details of the four feature sets follow.
These consist of hand-crafted musical features (such as onset rate, tempo, pitch
salience) as well as generic audio descriptors (such as spectral centroid, loudness).
Taken together, they reflect several musical characteristics such as tone colour,
dynamics, and rhythm. A brief description of all low-level features that we use
is given in Table 7.2. We use Essentia [32] and Librosa [129] for extracting these.
The audio is sampled at 44.1kHz and the spectra computed (when required)
with a frame size of 1024 samples and a hop size of 512 samples. Each feature is
aggregated over the entire duration of an audio clip by computing the mean and
7.2 feature sets for emotion modelling 106
Dynamic Complexity The average absolute deviation from the global loudness level
estimate in dB.
Loudness Mean loudness of the signal computed from the signal ampli-
tude.
Spectral Centroid The weighted mean frequency in the signal, with frequency
magnitudes as the weights.
Spectral Rolloff The frequency under which 85% of the total energy of the
spectrum is contained.
standard deviation over all the frames of the clip (a ‘clip’ being an 8 bar initial
segment from a recording).
The following set of features was computed directly from the musical score (i.e.,
sheet music) of the pieces instead of the audio files. The unit of score time, “beat”,
is defined by the time signature of the piece (e.g., 4/4 means that there are 4 beats
of duration 1 quarter in a bar). The score information and the audio files were
linked using automatic score-to-performance alignment. Table 7.3 describes the
score features in detail.
As described in Chapter 4, we learn the seven mid-level features from the Mid-
level Dataset [9] using a receptive-field regularised residual neural network
(RF-ResNet) model [108]. Since we intend to use this model to extract features
from solo piano recordings (a genre that is not covered by the original training
data), we use the domain-adaptive training approach described in Chapter 6 to
7.3 feature evaluation experiments 107
Inter Onset Interval The time interval between consecutive notes per beat.
Duration Two features describing the empirical mean and standard devi-
ation of the notated duration per beat in the snippet.
Onset Density The number of note onsets per beat. A chord constitutes a single
onset.
Key Strength This feature represents how much does the tonality by the
”Mode” feature fit the snippet.
transfer the features to the domain of solo piano recordings. We use an input
audio length of 30 seconds, padded or cropped as required.
To compare the mid-level features with another deep neural network based
feature extractor, we train a model with the same architecture (RF-ResNet) and
training strategy on the DEAM dataset [10] to predict arousal and valence from
spectrogram inputs. Since this model is trained to predict arousal and valence, it
is expected to learn representations suitable for this task. As with the mid-level
model, we perform unsupervised domain adaptation for solo piano audio while
training this model also.
Features are extracted from the penultimate layer of the model, which gives us
512 features. Since these are too many features to use for our dataset containing
only 288 data points, we perform dimensionality reduction using PCA (Principal
Component Analysis), to obtain 9 components explaining at least 98% of the vari-
ance. These 9 features are named as pca x with x being the principal component
number.
1. How well can each feature set fit the arousal and valence ratings? How
do these feature sets compare to the ones used by B&S? (Section 7.3.1 and
Section 7.3.2)
7.3 feature evaluation experiments 108
Arousal Valence
2. In each feature set, which features are the most important? (Section 7.3.3)
3. Which feature set best explains variation of arousal and valence between
pieces? (Section 7.3.4)
4. Which feature set best explains variation of arousal and valence between
different performances of the same piece? (Section 7.3.5)
As a starting point, we take the data used by B&S in Experiment 3 of their paper
– Gulda’s performances rated on valence and arousal. We perform regression
with our feature sets and compare with the values obtained by B&S using their
features Attack Rate, Pitch Height, and Mode. The results are summarised in
Table 7.4.
We can see that all three audio-based features (Low-level features, Mid-level
features, and DEAMResNet features) perform sufficiently well for both arousal
and valence to motivate further analysis.
Arousal Valence
Feature Set R̃2 RMSE r R̃2 RMSE r
(a) Regression metrics. Modelling the emotion ratings of all 288 excerpts using each feature set.
(b) Cross-validation metrics. R̃2 for different cross-validation splits, with a linear regression model
using each feature set.
Table 7.5: Evaluation using goodness of fit measures of the four feature sets on our data,
on the full dataset (a), or via cross-validation (b). Refer to the description in
Section 7.3.2.
in a fold, for a total of 6 folds), and leave-one-out (one recording is the test sample
per fold, for a total of 288 folds). This is summarised in Table 7.5b.
We see that Mid-level features show good generalisation for arousal and are
robust to different kinds of splits. They also show balanced performance between
arousal and valence for all splits. The good performance of the Score features on
the valence dimension (V), here and in the previous experiment, is mostly due
to the Mode feature; there is a substantial correlation in the annotations between
major/minor mode and positive/negative valence.
Recall from Section 3.2 that one way to measure feature importance is to use the
t-value (or the t-statistic) of the weights corresponding to the features. T-statistic
is defined as the estimated weight scaled with its standard error:
β̂ j
t β̂ j = (7.1)
SE( β̂ j )
Audio Features Importance for Arousal Audio Features Importance for Valence
6 8
Mid-level Features Mid-level Features
na ity_s a_3
g_p s_ _2
m_ _st ess
_ar ple y
ula y
n
om tabi ce
ple p lity
c_c ud ev
n
tra omp ness
str atn p xity
k_m an
ml com abilit
tic xit
ml pca_
ea
tio
on es ca
mi lo td
ea me
_rh lienc _5
y
na ess
tra com v
atn xity
tra rticu n
tra ness n
v
an
yth hm usn
x c
le
ilit
m_ tde
en stde
a ea
sp l_flat latio
a
me
n
_sa pc
tab
l_fl ple
_m
_rh yt io
yth e_s
ml inor
_
id_
ml l_rh elod
ess
l_s
_m
tro
m l_m
l_fl
ml
_
_to
m l
l_c
l_c
m
dy
ch
ec
pit
sp
ec
ec
ec
ec
ml
sp
sp
(a) T-values for Arousal (b) T-values for Valence
Figure 7.3: Feature importance for audio features using T-statistic. Only features with
p < 0.05 are shown.
We see that the top-4 and top-2 features in arousal and valence, respectively, are
mid-level features. These features also make obvious musical sense – modality is
often correlated with valence (positive or negative emotional quality), and rhythm
and articulation with arousal (intensity or energy of the emotion).
Taking a closer look at Figure 7.1, we notice (as we expect) that each piece has a
distinct emotion of itself – in terms of its own arousal and valence – which gets
modified by performers, leading to the spread we see in the arousal and valence
across performers. We can see that the spreads (variances) are not large enough
to rule out the apparent effect that “piece id”, considered as a variable, has on
the emotion of a recording. In other words, the emotion of a recording is not
independent of the piece id. The linear mixed effects model [147] is normally used
for such non-independent or multi-category data.
Mixed effect models incorporate two kinds of effects for modelling the depen-
dent variable. These effects are called fixed and random effects. Fixed effects are
those variables that represent the overall population-level trends of the data. The
fixed effect parameters of the model do not change across experiments or groups.
In contrast, random effects parameters change according to some grouping factor
(e.g. participants or items). Random effects are clusters of dependent data points
in which the component observations come from the same higher-level group
(e.g., an individual participant or item) and are included in mixed-effects models
to account for the fact that the behaviour of particular participants or items may
differ from the average trend [35].
In our case of modelling piece-wise emotion variation, the linear mixed effect
model consists of the piece id as a random effect intercept, which models part
of the residual remaining unexplained by the features we are evaluating (fixed
effects). A feature set that models piece-wise variation better than another set
would naturally have a lesser residual variation to be explained by the random
7.3 feature evaluation experiments 111
Erandom
Feature Set Arousal Valence
Table 7.6: Explaining piece-wise variation using the four feature sets. The fraction of
residual variance explained by the random effect of “piece id” (defined in
Section 7.3.4) is reported here. Lower means better explained.
effect. We therefore look at which feature set has the least fraction of residual
variance explained by the random effect of piece id, defined as:
Varrandom
Erandom = (7.2)
Varrandom + Varresidual
where Varrandom is the variance of the random effect intercept and Varresidual is
the variance of the residual that remains after mixed effects modelling.
We see from Table 7.6 that the DEAMResNet emotion features best explain
piece-wise variation in arousal, followed closely by Mid-level features. For valence,
the performance of all three audio-based features are close, with Mid-level features
performing the best, however, score features outperform them with a large margin.
This is again due to the relationship between mode and valence, and mode co-
varying tightly with the piece ids.
Arousal Valence
Feature Set FVU r (n p<0.1 ) FVU r (n p<0.1 )
Table 7.7: Explaining performance-wise variation using the three audio-based feature
sets. FVU: Fraction of Variance Unexplained. r: Pearson correlation coefficient.
emotion dimensions follow the ratings closely, even for performances that deviate
from the average (e.g. the arousals of Gulda’s performance of Prelude in A major
and Tureck’s performance of Fugue in E minor.)
Tureck
Gould
Gulda
Schiff
Hewitt
Richter
Gould
Gulda
Schiff
Hewitt
Richter
Figure 7.4: Some example pieces with high emotion variability between performances
which are modelled particularly well using mid-level features.
0.0 Tureck
Gulda
0.5
1.0 Gould
1.5
2.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Valence Valence
Figure 7.5 shows two examples of pieces where one performance has a vastly
different emotional character than the others – in the first example, Gould even
produces a negative valence effect (mostly through tempo and articulation) in
the E-flat major prelude, which the others play in a much more flowing fash-
ion. A challenge for any model would thus be to predict the emotion of such
idiosyncratic performances, not having seen them during training.
We therefore create a test set by picking out the outlier performance for each
piece in arousal-valence space using the elliptic envelope method [158]. This gives
us a split of 240 training and 48 test samples (the outliers). We train a linear
regression model using each of our feature sets and report the performance on
the outlier test set in Figure 7.6. We see again that Mid-level features outperform
the others, for both emotion dimensions. We take this as another piece of evidence
for the ability of the mid-level features to capture performance-specific aspects.
The surprisingly good performance of score features for valence can be attributed
to the fact that for most pieces, the outlier points are separated mostly in the
arousal dimension – the spread of valence is rather small (though not always,
see the Gould case in Figure 7.5) – and the score feature “mode” is an important
predictor of valence (see earlier sections).
details about the dataset and see how we can demonstrate mid-level based
emotion prediction on this dataset.
Dataset Description
As the first demonstration of the mid-level based emotion prediction for piano
performances, we train a multiple linear regression model using the Bach WTC
3 See Appendix a for more details on this dataset.
7.4 probing further 115
100
80
Vogt Vorraber
Horowitz MIDI Gould
Argerich Brendel
Gardon
60
Brendel
Schiff Schiff
Arousal
Gieseking
Kempff
Vorraber Ax Vorraber
Bavouzet
Horowitz Serkin Angelich Horowitz
40 Kempff MIDI
Gould
Katsaris Schiff Vogt Vogt Brendel
Rubinstein Richter Pires Rubinstein
Lim Horowitz Gulda Uchida Bach Prelude No.1 in C
Casadesus Lazic Argerich Stadtfeld Mozart K545 2nd mvt.
20 Gulda Beethoven Op. 27 No.2 1st mvt.
Vorraber Grimaud Brahms Op. 119 2. Intermezzo
Schirmer Liszt Bagatelle S.216a
Schumann Ara. Op. 18 (ex. 1)
Schumann Ara. Op. 18 (ex. 2)
Schumann Kr. Op. 16 (ex. 1)
Schumann Kr. Op. 16 (ex. 2)
0
4 2 0 2 4
Valence
Figure 7.7: Predicted emotions using mid-level feature predictions for the performances
from the Con Espressione Dataset (Section 7.4.2). Performances of the same
piece are marked with the same colour.
dataset described previously in this chapter (with mid-level features as inputs and
arousal-valence as outputs), and predict the emotion values for the performances
of the Con Espressione Dataset. The predictions are then visualised on the arousal-
valence plane (Figure 7.7), showing the expressive diversity in the performances.
An interesting observation in this plot is Glenn Gould’s performance of the
Mozart piece. Gould is known to have very unconventional interpretations and
a distinctive style in several performances, and his performance of the Mozart
piece indeed sticks out as an outlier in the arousal-valence space.
“rhythmically complex” in any objective musical sense (and more so than the other
performances of this piece in the dataset) is debatable. The human annotations
in the mid-level feature dataset are not to be expected to reflect musicological
concepts, but rather general impressions that may well be influenced by other
factors than the one implied by the feature name. This also holds, in particular,
for the “minorness” feature – see below.
According to Figure 7.8, the high valence of Gould’s performance can be
attributed to low minorness (resulting in high positive effect on valence, since
minorness has a negative mean effect on valence). Again, it is important to
remember that the minorness feature does not purely relate to the mode of the
audio recording, but also other factors that are perceived to be associated with
“minor sounding” songs. In this sense, Gould’s performance might be perceived as
less “minor” than the other performances because of the abnormally abnormally
high tempo that Gould plays this piece in, compared to others.
MOZART
arousal valence
MIDI
minorness Gulda
tonal_stability Uchida
Pires
dissonance Gould
rhythm_stability
rhythm_complexity
articulation
melodiousness
100 50 0 50 100 5 0 5 10
arousal valence
MIDI
minorness Gulda
tonal_stability Uchida
Pires
dissonance Gould
rhythm_stability
rhythm_complexity
articulation
melodiousness
20 0 20 2 0 2 4
(b) Effects, centred on the mean across the five performances for each mid-level feature
Figure 7.8: Effects (a) of mid-level features on prediction of valence and arousal for the
five performances of Mozart’s Piano Sonata No. 16 (K545), 2nd movement.
The diamonds represent the effects of the five performers. The box plots
represent the distribution of the effects for the training dataset (Bach WTC).
On the centred plot (b), the spread of the effects for the different performances
is more clearly visible.
7.4 probing further 117
Table 7.8: For five versions of Mozart’s Piano Sonata No. 16 (K545), 2nd movement, the
performance descriptor words are predicted by fitting mid-level features to
the PCA dimensions of the occurrence matrix of the Con Espressione dataset.
These are compared with actual answers that participants entered in response
to the respective performances.
Using the PCA dimensions described above as the dependent variables, and
mid-level feature predictions as independent variables, we can train a simple
regression model mapping mid-level feature values to the description space. We
then use this model to predict the positions of the Mozart performances in the
PCA dimension space, and find the nearest words in the space from the dataset
as the descriptive words predicted for the performances. A visualisation of all
the words in the training dataset (199 words derived from a total of ∼ 1500 after
filtering for minumum number of occurrences and entropy across performances,
as described in Cancino-Chacón et al. [37]), mapped onto the first two PCA
dimensions is shown in Figure 7.9. Also plotted are the predicted positions (solid
coloured diamonds) and the ground truth positions (lighter, smaller diamonds –
obtained by computing the centroid of all ground truth word positions for each
performance) of the Mozart performances. The nearest words to the predicted
points are also highlighted4 . In Table 7.8 these predicted descriptive words are
compared to some of the answers entered by participants of the CEG in response
to the respective performances (for this table, the human answers were selected
randomly, but single-word answers were excluded).
4 In Figure 7.9, the visualised positions of the words may be shifted slightly from their actual
positions on the plane, in order to avoid too much overlapped text. We use the Python package
adjustText (https://github.com/Phlya/adjustText) to do this.
monotonous
mechanic
playful
pires nervous
uchida lively hasty
gulda rushed
gould too_fast
midi
Figure 7.9: Visualisation of the performance descriptor words present in the Con Espressione Dataset, projected on the first two PCA dimensions of the
word occurrence matrix. The solid coloured diamonds are predicted positions in this space of the five performances for Mozart’s Piano
Sonata No. 16 (K545), 2nd movement. The words nearest to these are highlighted in same colours. The smaller, fainter diamonds are the
ground-truth positions, meaning the centroids across the descriptor words for a performance.
7.4 probing further
118
7.5 discussion and conclusion 119
120
8.1 augmenting the mid-level feature space 121
1. Melodiousness
2. Articulation
3. Rhythmic complexity
4. Rhythmic stability
5. Dissonance
6. Tonal stability
7. Modality (Minorness)
While we have shown in the earlier chapters of this thesis that this set of
seven mid-level features captures variation in music emotion surprisingly well,
two obvious features important for emotional expression – perceptual speed and
dynamics – are conspicuously missing from it [34]. Musical cues such as attack
rate and dynamics have been shown in previous experiments to contribute
significantly to emotional expression [53]. Our hypothesis is that augmenting
the mid-level feature space with these two additional features should improve
explainable emotion modelling significantly. In this section, we demonstrate the
efficacy of adding (analogues of) perceptual speed and dynamics to improve
modelling of musical emotion. These two features will be modelled in a more
direct way, based on our musical intuition rather than on empirical user perception
data.
• Perceptual Speed
Recall from Chapter 4, Section 4.1.2 that Friberg et al. [64] used the following
definition of perceptual speed in their work on perceptual features:
“Indicates the general speed of the music disregarding any deeper
analysis such as the tempo, and is easy for both musicians and
non-musicians to relate to.”
Note the distinction between perceptual speed and tempo. While tempo is
typically computed as the occurrence rate of the most prominent metrical
level (beat), perceptual speed is influenced by lower level or higher level
metrical levels as well – factors such as note density (onsets per second)
8.1 augmenting the mid-level feature space 122
seem to be important [56]. Madison and Paulin [126] find that there is a
non-linear relationship between rated perceptual speed and tempo. In actual
music (not a metronome track), a high tempo seems to be counteracted
by a lower relative event density and vice versa, resulting in a sigmoid
shape on the perceptual speed vs tempo plot, with shallower slopes for the
extreme tempo ranges compared to the middle range. They find that event
density (number of sound events per unit time) contributes substantially to
perceptual speed.
• Perceived Dynamics
Perceived dynamics refers to the perceived force or energy expended by
musicians on their instruments while performing, as inferred by a listener.
Going back again to Section 4.1.2, let us recall how dynamics was defined:
“Indicates the played dynamic level disregarding listening volume.
It is presumably related to the estimated effort of the player.”
Note the distinction between dynamics and volume – dynamics does not
only refer to the sound intensity level. As Elowsson and Friberg [57] point
out, loudness and timbre are closely related to perceived dynamics, and
spectral properties of most musical instruments change in a complex way
with performed dynamics.
While learning perceptual speed in a data-driven fashion like the other mid-level
features would be ideal, most works (such as [56] and [64]) on perceptual speed
have used small, privately collected datasets for their experiments. Training large-
scale models on such small datasets is not feasible; moreover, privately collected
datasets are often not available. Therefore, we attempt to approximate or emulate
perceptual speed, by taking advantage of the observation that perceptual speed is
significantly correlated with event density [56, 126] and investigate approximating
the perceptual speed by computing onset density.
‘Onset’ refers to the beginning of a musical note or other sonic event. It is
related to (but different from) the concept of transient: all musical notes have an
onset, but do not necessarily include an initial transient. Onset detection is the
task of identifying and extracting onsets from audio. Onset density (analogous to
event density) is simply the number of onsets per unit time.
We experiment with two different onset density extraction methods. The first
is the SuperFlux method of onset detection [30], which extracts an onset strength
curve by computing the frame-wise difference of the magnitude spectrogram
(spectral flux) followed by a vibrato suppression stage. The onsets are detected
by applying a peak picking function on the onset strength curve. This is a purely
signal processing based method.
The second method is applicable for our specific context of solo piano music.
The idea is to use a piano transcription algorithm to predict the played notes
and the note onset times from an audio recording, and to obtain the onset curve
by summing over the pitch dimension. Figure 8.1 explains this process. We
8.1 augmenting the mid-level feature space 123
Transcription
RNNPianoNoteProcessor
Onsets Peak
Picking
Onset curve
For perceived dynamics, again, we do not have any annotated public dataset, to
the best of our knowledge. Elowsson and Friberg [57] use a pipeline of a large
number of handcrafted low-level features to approximate performed dynamics.
In our case of solo piano music, we find that the RMS (Root-Mean-Squared)
amplitude of the audio signal is a good candidate feature that is able to capture a
significant variation in emotion, and is easy to understand (from an interpretabil-
ity perspective). We use this feature as an approximation to performed dynamics
(estimated effort of the player) for solo piano music because the relationship
between note velocity (the force with which a keyboard key is pressed) and
loudness can be assumed to be monotonic for the piano [5].
We use Librosa’s RMS function [129] to compute this
q feature. For an input audio
signal x, the RMS amplitude is given as RMSk = mean(wτ ( x )2k ), k = 1 . . . N,
where wτ (·)k is a rectangular windowing function which partitions the input
sequence into frames and returns the k-th frame of length τ, and N is the total
number of frames.
8.1 augmenting the mid-level feature space 124
Adjusted R2
Feature Set Arousal Valence
Table 8.1: Performance of the different feature sets on modelling arousal and valence of
the Bach WTC Book 1 dataset.
For our present case, we consider the two newly added features (onset density
and RMS amplitude) as a part of the “mid-level feature set”. While technically
these two features are computed using low-level algorithms instead of being
learned from data, we still consider them under the ambit of “mid-level” for the
purposes of this chapter, since we treat them as approximations of perceived
speed and perceived dynamics. To distinguish between the original set of seven
mid-level features, and the new augmented feature set of nine features, we will
call them (7)-mid-level features and (9)-mid-level features, respectively, in this
chapter.
To evaluate the effect of adding the additional features to our original set of
mid-level features, we use the Bach Well-Tempered Clavier Book 1 (WTC) dataset
from Chapter 7. Remember that the dataset contains recordings of the first eight
bars of all 48 pieces of the WTC Book 1 performed by 6 different pianists, for
a total of 288 recordings. We perform the regression based evaluation as done
previously in Section 7.3.2. First, we predict the original (7)-mid-level features
for the 288 Bach recordings using a domain adapted RF-ResNet model, as was
done in Section 7.3.2. We then compute the mean onset densities and mean RMS
amplitudes for each of the recordings using the approximations mentioned above,
giving us the (9)-mid-level feature set for the Bach data. The effectiveness of
this feature set in modelling emotion is evaluated by fitting a multiple linear
regression model with nine inputs and two outputs (for arousal and valence). We
look at the adjusted R2 -score as the metric. This is compared to the case where
only the original seven (7)-mid-level features are used, and where only the two
newly added features are used. The results are tabulated in Table 8.1.
Firstly, we note that while onset density and RMS amplitude alone cannot
predict valence to a good extent, using just the onset density and the RMS
amplitude for arousal prediction gives a better fit than the original (7)-mid-level
feature set. For both arousal and valence, using the combined (9)-mid-level feature
set gives the best result.
We also look at the absolute value of the t-statistic, shown in Figure 8.2, to
evaluate the relative feature importance values. We see that for arousal, onset
density is the most important feature, followed by RMS amplitude. Among the
original (7)-mid-level features, the top-3 are rhythm stability, rhythm complexity,
and melodiousness, which were also the top-3 features in Section 7.3.2. For
8.2 decoding and visualising intended emotion 125
y
lod xity
mi ess
y
dis ness
al_ ce
art bility
ion
set ness
al_ ity
dis ility
art ance
lod tion
ess
rhy mpl s
_st y
_co rm
_co bilit
thm exit
ilit
ton an
ns
ton ns
r
lat
sn
sn
b
ab
me mple
me cula
r
r
son
son
_de
_de
thm sta
sta
sta
no
no
icu
iou
iou
mi
i
set
h
on
on
thm
t
rhy
rhy
(a) T-values for Arousal (b) T-values for Valence
Figure 8.2: Feature importance for the original set of seven mid-level features, and the
newly added features in this chapter.
Since Jacob plays and modifies the song continuously in real time, we wish to
predict the emotions in a dynamic fashion. This will let us visualise how the
model reacts to changes in playing style as the intended emotion changes. In
other words, we wish to perform dynamic emotion recognition, instead of static
emotion recognition. For dynamic emotion recognition, the audio is split into
windows and emotions are predicted for each window. The smaller the window
size, the quicker the model outputs react to changes in the performance.
We will build our emotion model in a manner similar to the model that we
used for analysis in Section 8.1. To recall the full pipeline, the steps are:
2. Use this model to predict (the original seven) mid-level feature values for
the 288 recordings in the Bach WTC Book 1 dataset.
3. Compute onset density and RMS amplitude for the 288 recordings in the
Bach WTC Book 1 dataset.
4. Train a mid-level to emotion model on the Bach WTC Book 1 dataset, using
(9)-mid-level features as inputs and the arousal/valence annotations as
outputs (we use Multiple Linear Regression (MLR) with nine inputs and
two outputs).
However, recall that the mid-level models in previous chapters were trained
on input spectrograms of length 15 seconds, which is too long of a window
for the model to react to quick changes in emotion in our present audio. We
therefore experiment with training mid-level feature models with smaller input
audio lengths. The training performance with different input lengths is shown
in Figure 8.3. We choose a 5-second window for our final model as a reasonable
trade-off between prediction accuracy and window size. This model is then used
for step 1 above. The rest of the steps remain the same.
0.68
0.65
Average Correlation Coeff
0.62
0.60
0.57
0.55
0.53
0.50
0.48
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Input audio length (seconds)
Figure 8.3: Mid-level feature model performance with respect to input audio length.
Six frames from the video together with the predictions are shown in Figure 8.5
(and continued in Figure 8.6). We can see that the predicted emotions match
closely with the intended emotions (“Jacob’s Emotions”). Note that the frames
shown here are captured at times when the predicted emotion and intended
emotion come closest visually (for those emotions that are present on Russell’s
circumplex, such as “happy”, “sad”, “anger” and “serene”). The full trace of the
prediction point is shown in Figure 8.4b.
We also obtain static emotions – the audio sections corresponding to each
of the seven emotions are cut out and used as individual input audio files for
8.2 decoding and visualising intended emotion 128
the model. In this case, we use our standard input length (15-second) mid-level
feature model, with the input audio being looped if it is less than 15 seconds,
and the predictions for successive windows with a 5-second hop averaged if it is
more than 15 seconds. These static emotion predictions are shown on Figure 8.4a,
where the predicted points are annotated with the intended emotion for each.
The visualisation experiment presented in this section serves as an interesting
proof-of-concept for further, more rigorous, experiments on decoding intended
emotions using computer systems. We can see that a simple linear regression
model, with a handful of mid-level features as inputs (7 original plus 2 new),
trained on a small dataset of 288 Bach WTC examples, is able to predict the
intended emotions for a markedly different set of performances in a fairly satis-
factory manner. This points to the robustness of the (9)-mid-level features and of
our (7)-mid-level feature model, and to the impressive capacity of these features
to reflect encoded music emotion. The full demonstration video can be found
here: 0 .
8.2 decoding and visualising intended emotion 129
100
alarmed aroused
tense astonished
afraid angry excited
80 annoyed
distressed angry
frustrated triumphant
delighted
happy
60
mysterious happy
Arousal
miserable pleased
neutral
40 sad
sad
depressed
gloomy serene
serene content
at ease
20 bored satisfied
relaxed
calm
droopy
tiredsleepy
0
4 3 2 1 0 1 2 3 4
Valence
miserable pleased
40
sad
depressed
gloomy serene
20 at content
ease
bored satisfied
relaxed
calm
droopy
tiredsleepy
0
4 2 0 2 4
Valence
Figure 8.4: Static and dynamic emotion prediction for Jacob Collier’s performance of
Danny Boy according to seven emotions: “neutral”, “happy”, “sad”, “angry”,
“mysterious”, “triumphant”, and “serene”.
8.2 decoding and visualising intended emotion 130
Figure 8.5: Screenshots during different times of Jacob Collier’s performance video,
overlaid with the corresponding predicted emotions.
8.2 decoding and visualising intended emotion 131
Figure 8.6: Screenshots during different times of Jacob Collier’s performance video,
overlaid with the corresponding predicted emotions.
9
CONCLUSION AND FUTURE WORK
9.1 conclusion
In this thesis, we set out with the goal of investigating the problem of music
emotion recognition (from audio recordings) through the lens of interpretability
(or explainability) by using perceptually relevant musical features. To this end,
we first proposed a bottleneck model that is trained using perceptual mid-level
features and music emotion labels (Section 4.2). We trained a deep model to
predict mid-level features – “melodiousness”, “articulation”, “rhythm stability”,
“rhythm complexity”, “dissonance”, “tonal stability”, and “minorness” – as an
intermediate layer (the bottleneck), from which the final emotion values were
then predicted. The mid-level features as well as the emotions were learned
from human annotated datasets. The mid-level to emotion model was made
explainable by virtue of the interpretability of the features themselves, and
by using a linear model that predicted emotion from mid-level features. We
explained the predictions in terms of the learned weights, as well as the effect
of each mid-level feature on the output value of a particular emotion prediction
(Section 4.5).
Next, we introduced two approaches to explain the part of the model between
the audio (spectrogram) inputs and the mid-level bottleneck layer using two
methods. The first was to explain mid-level feature predictions by training a
surrogate linear model using LIME (Local Interpretable Model-agnostic Expla-
nations) and using this to indicate important patches in the input spectrogram
(Section 5.4), which could also be transformed back to (low-quality) audio. The
second approach used audioLIME to explain mid-level predictions using an
interpretable decomposition of the input audio into its musical sources (the audio
track is split into five instrument components: vocals, piano, bass, drums, and
other) (Section 5.5).
Equipped with mid-level features for predicting and explaining music emotion,
we then turned to modelling emotional variation in piano performances. In order
to maintain model validity for solo piano music, we proposed an unsupervised
domain adaptation and refinement pipeline to transfer mid-level feature models
to the piano domain. We used the well-known “unsupervised domain adaptation
using backpropagation” approach to learn domain invariant feature spaces, and
introduced a self-training based refinement stage to further improve performance
on piano music (Chapter 6).
132
9.2 future work 133
APPENDIX
a
D ATA S E T S U S E D I N T H I S T H E S I S
Several datasets have been used across the different chapters of this thesis. The
present appendix serves as a quick reference for all those datasets at one place.
A summary of the datasets are given below (Table a.1), and details about each
dataset is provided in the following sections, in the order of their appearance in
the thesis.
Mid-level Features [9] Yes Ratings for 7 mid-level features Chapters 4, 5, 6, 7 and 8
Soundtracks [54] Yes Ratings for 8 emotions Chapter 4
PMEmo [184] Yes Ratings for arousal and valence Chapter 5
MAESTRO [85] Yes None Chapter 6
DEAM [10] Yes Ratings for arousal and valence Chapter 7
Con Espressione [37] Yes* Free-text descriptions Chapters 6, 7 and 8
Table a.1: Summary of the datasets used in this thesis. The annotations column mentions
which type of annotations were used in the thesis for each dataset. *The audio
files for the Con Espressione dataset are not distributed in the released version
of the dataset due to copyright reasons.
136
a.2 the soundtracks dataset 137
Music Emotion dataset [143]. No more than five songs from the same artist
were allowed to be present in the dataset. The ratings for the seven mid-level
perceptual features, as defined in [9], were collected through crowd-sourcing.
To help the human participants interpret the mid-level concepts, the mid-level
features were described in the form of questions, as reproduced below (the ratings
were collected in a pairwise comparison scenario).
3. Rhythm Stability: Imagine marching along with the music. Which is easier
to march along with?
5. Dissonance: Which excerpt has noisier timbre? Has more dissonant inter-
vals (tritones, seconds, etc.)?
6. Tonal Stability: Where is it easier to determine the tonic and key? In which
excerpt are there more modulations?
For obtaining ratings, Aljanaki and Soleymani [9] first used pairwise compar-
isons to get rankings for a small subset of the dataset, which was then used to
create an absolute scale on which the whole dataset was then annotated. The
annotators were required to have some musical education and were selected
based on passing a musical test. The ratings range from 1 to 10.
Section 2.3.2 “Schimmack and Grob model of emotion”). All the excerpts were
rated by 116 non-musicians on all eight perceived emotions (anger, fear, sadness,
happiness, tenderness, valence, energy, and tension) on a scale of 1-9.
1 https://y.qq.com/n/yqq/toplist/108.html
2 https://y.qq.com/n/yqq/toplist/123.html
3 https://y.qq.com/n/yqq/toplist/107.html
a.4 the maestro dataset 139
Performances 1184
Compositions (approx.) 430
Total audio duration 172.3 hours
URL https://magenta.tensorflow.org/datasets/maestro
4 https://piano-e-competition.com/
a.6 the con espressione dataset 140
Bach Prelude No.1 in C, BWV 846 (WTC I) 7 Gieseking, Gould, Grimaud, Kempff, Richter,
Stadtfeld, MIDI
Mozart Piano Sonata K.545 C major, 2nd mvt. 5 Gould, Gulda, Pires, Uchida, MIDI
Beethoven Piano Sonata Op.27 No.2 C# minor, 1st mvt. 6 Casadesus, Lazić, Lim, Gulda, Schiff, Schirmer
Schumann Arabeske Op.18 C major (excerpt 1) 4 Rubinstein, Schiff, Vorraber, Horowitz
Schumann Arabeske Op.18 C major (excerpt 2) 4 Rubinstein, Schiff, Vorraber, Horowitz
Schumann Kreisleriana Op.16; 3. Sehr aufgeregt (ex. 1) 5 Argerich, Brendel, Horowitz, Vogt, Vorraber
Schumann Kreisleriana Op.16; 3. Sehr aufgeregt (ex. 2) 5 Argerich, Brendel, Horowitz, Vogt, Vorraber
Liszt Bagatelle sans tonalité, S.216a 4 Bavouzet, Brendel, Katsaris, Gardon
Brahms 4 Klavierstücke Op.119, 2. Intermezzo E minor 5 Angelich, Ax, Serkin, Kempff, Vogt
Table a.2: Performances used in the Con Espressione Game, as described in Cancino-
Chacón et al. [37]
Number of excerpts 45
Total audio duration 1.0 hour
Number of responses 1515
Total terms 3166
Unique terms 1415
URL https://cpjku.github.io/con espressione game ismir2020/
5 con-espressione.cp.jku.at
BIBLIOGRAPHY
[1] Jakob Abeßer and Meinard Müller. “Towards Audio Domain Adaptation
for Acoustic Scene Classification using Disentanglement Learning.” In:
arXiv preprint arXiv:2110.13586 (2021).
[2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal
Fua, and Sabine Süsstrunk. “SLIC superpixels compared to state-of-the-art
superpixel methods.” In: IEEE transactions on pattern analysis and machine
intelligence 34.11 (2012), pp. 2274–2282.
[3] Amina Adadi and Mohammed Berrada. “Peeking Inside the Black-Box:
A Survey on Explainable Artificial Intelligence (XAI).” In: IEEE Access 6
(2018), pp. 52138–52160.
[4] Kyle Adams. “On the metrical techniques of flow in rap music.” In: Music
Theory Online 15.5 (2009).
[5] Alexander Adli, Zensho Nakao, and Yanunori Nagata. “Calculating the
expected sound intensity level of solo piano sound in MIDI file.” In: SCIS
& ISIS SCIS & ISIS 2006. Japan Society for Fuzzy Theory and Intelligent
Informatics. 2006, pp. 731–736.
[6] Darius Afchar, Alessandro B Melchiorre, Markus Schedl, Romain Hen-
nequin, Elena V Epure, and Manuel Moussallam. “Explainability in Music
Recommender Systems.” In: arXiv preprint arXiv:2201.10528 (2022).
[7] Darius Afchar, Alessandro B. Melchiorre, Markus Schedl, Romain Hen-
nequin, Elena V. Epure, and Manuel Moussallam. “Explainability in Music
Recommender Systems.” In: ArXiv abs/2201.10528 (2022).
[8] Jessica Akkermans, Renee Schapiro, Daniel Müllensiefen, Kelly Jakubowski,
Daniel Shanahan, David Baker, Veronika Busch, Kai Lothwesen, Paul
Elvers, Timo Fischinger, et al. “Decoding emotions in expressive music
performances: A multi-lab replication and extension study.” In: Cognition
and Emotion 33.6 (2019), pp. 1099–1118.
[9] Anna Aljanaki and Mohammad Soleymani. “A Data-driven Approach to
Mid-level Perceptual Musical Feature Modeling.” In: Proceedings of the 19th
International Society for Music Information Retrieval Conference, ISMIR 2018,
Paris, France. 2018, pp. 615–621.
[10] Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. “Developing
a Benchmark for Emotional Analysis of Music.” In: PloS one 12.3 (2017).
[11] Pedro Álvarez, A Guiu, José Ramón Beltrán, J Garcı́a de Quirós, and
Sandra Baldassarri. “DJ-Running: An Emotion-based System for Recom-
mending Spotify Songs to Runners.” In: icSPORTS. 2019, pp. 55–63.
141
bibliography 142
[12] André Araujo, Wade Norris, and Jack Sim. Computing Receptive Fields
of Convolutional Neural Networks. 2019. doi: 10.23915/distill.00021. url:
https://distill.pub/2019/computing-receptive-fields.
[13] Hussain-Abdulah Arjmand, Jesper Hohagen, Bryan Paton, and Nikki S
Rickard. “Emotional responses to music: Shifts in frontal brain asymmetry
mark periods of musical change.” In: Frontiers in psychology 8 (2017),
p. 2044.
[14] Alejandro Barredo Arrieta et al. “Explainable Artificial Intelligence (XAI):
Concepts, Taxonomies, Opportunities and Challenges toward Responsible
AI.” In: ArXiv abs/1910.10045 (2020).
[15] Taichi Asami, Ryo Masumura, Yoshikazu Yamaguchi, Hirokazu Masataki,
and Yushi Aono. “Domain Adaptation of DNN Acoustic Models using
Knowledge Distillation.” In: 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5185–5189.
[16] Jean-Julien Aucouturier, Francois Pachet, et al. “Music similarity measures:
What’s the use?” In: Ismir. 2002, pp. 13–17.
[17] Laura-Lee Balkwill, William Forde Thompson, and RIE Matsunaga. “Recog-
nition of emotion in Japanese, Western, and Hindustani music by Japanese
listeners 1.” In: Japanese Psychological Research 46.4 (2004), pp. 337–349.
[18] Eugene Y Bann and Joanna J Bryson. “The conceptualisation of emotion
qualia: Semantic clustering of emotional tweets.” In: Computational models
of cognitive processes: Proceedings of the 13th neural computation and psychology
workshop. World Scientific. 2014, pp. 249–263.
[19] Mathieu Barthet, György Fazekas, and Mark Sandler. “Music emotion
recognition: From content-to context-based models.” In: International sym-
posium on computer music modeling and retrieval. Springer. 2012, pp. 228–
252.
[20] Aimee Battcock and Michael Schutz. “Acoustically expressing affect.” In:
Music Perception: An Interdisciplinary Journal 37.1 (2019), pp. 66–91.
[21] Aimee Battcock and Michael Schutz. “Individualized interpretation: Ex-
ploring structural and interpretive effects on evaluations of emotional
content in Bach’s Well Tempered Clavier.” In: Journal of New Music Research
50.5 (2021), pp. 447–468.
[22] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando
Pereira, and Jennifer Wortman Vaughan. “A theory of learning from
different domains.” In: Machine learning 79.1 (2010), pp. 151–175.
[23] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. “Anal-
ysis of representations for domain adaptation.” In: Advances in neural
information processing systems 19 (2006).
[24] Shai Ben-David, Tyler Lu, Teresa Luu, and Dávid Pál. “Impossibility theo-
rems for domain adaptation.” In: Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics. JMLR Workshop and Con-
ference Proceedings. 2010, pp. 129–136.
bibliography 143
[49] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun
Ting, Karthikeyan Shanmugam, and Payel Das. “Explanations based on
the missing: Towards contrastive explanations with pertinent negatives.”
In: Advances in neural information processing systems 31 (2018).
[50] Karen van Dijk. AI Song Contest. 2020. url: https://www.vprobroadcast.
com/titles/ai-songcontest.html (visited on 09/29/2022).
[51] Matthias Dorfer and Gerhard Widmer. “Training general-purpose audio
tagging networks with noisy labels and iterative self-verification.” In:
Proceedings of the Detection and Classification of Acoustic Scenes and Events
2018 Workshop (DCASE2018). 2018, pp. 178–182.
[52] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of inter-
pretable machine learning.” In: arXiv preprint arXiv:1702.08608 (2017).
[53] Tuomas Eerola, Anders Friberg, and Roberto Bresin. “Emotional expres-
sion in music: contribution, linearity, and additivity of primary musical
cues.” In: Frontiers in psychology 4 (2013), p. 487.
[54] Tuomas Eerola and Jonna K. Vuoskoski. “A comparison of the discrete
and dimensional models of emotion in music.” In: Psychology of Music
39.1 (2011), pp. 18–49. doi: 10 . 1177 / 0305735610362821. eprint: https :
//doi.org/10.1177/0305735610362821. url: https://doi.org/10.1177/
0305735610362821.
[55] Paul Ekman and Wallace V Friesen. “Constants across cultures in the face
and emotion.” In: Journal of personality and social psychology 17.2 (1971),
p. 124.
[56] Anders Elowsson and Anders Friberg. “Modelling perception of speed in
music audio.” In: Proceedings of the Sound and Music Computing Conference.
Citeseer. 2013, pp. 735–741.
[57] Anders Elowsson and Anders Friberg. “Predicting the perception of per-
formed dynamics in music audio with ensemble learning.” In: The Journal
of the Acoustical Society of America 141.3 (2017), pp. 2224–2242.
[58] Mehmet Bilal Er and Ibrahim Berkan Aydilek. “Music emotion recognition
by using chroma spectrogram and deep visual features.” In: International
Journal of Computational Intelligence Systems 12.2 (2019), pp. 1622–1634.
[59] Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R Arabnia.
“A brief review of domain adaptation.” In: Advances in Data Science and
Information Engineering (2021), pp. 877–894.
[60] Pedro F Felzenszwalb and Daniel P Huttenlocher. “Efficient graph-based
image segmentation.” In: International journal of computer vision 59.2 (2004),
pp. 167–181.
[61] Francesco Foscarin, Katharina Hoedt, Verena Praher, Arthur Flexer, and
Gerhard Widmer. “Concept-Based Techniques for ”Musicologist-friendly”
Explanations in a Deep Music Classifier.” In: arXiv preprint arXiv:2208.12485
(2022).
bibliography 146
[87] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual
Learning for Image Recognition.” In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016, pp. 770–778.
[88] Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam.
“Spleeter: a Fast and Efficient Music Source Separation Tool with Pre-
trained Models.” In: Journal of Open Source Software 5.50 (2020). Deezer
Research, p. 2154. doi: 10.21105/joss.02154.
[89] Kate Hevner. “Experimental studies of the elements of expression in
music.” In: The American journal of psychology 48.2 (1936), pp. 246–268.
[90] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge
in a Neural Network.” In: arXiv preprint arXiv:1503.02531 (2015).
[91] Andre Holzapfel, Bob Sturm, and Mark Coeckelbergh. “Ethical dimen-
sions of music information retrieval technology.” In: Transactions of the
International Society for Music Information Retrieval 1.1 (2018), pp. 44–55.
[92] Cheng-Zhi Anna Huang, Hendrik Vincent Koops, Ed Newton-Rex, Monica
Dinculescu, and Carrie J Cai. “AI song contest: Human-AI co-creation in
songwriting.” In: arXiv preprint arXiv:2010.05388 (2020).
[93] Moyuan Huang, Wenge Rong, Tom Arjannikov, Nan Jiang, and Zhang
Xiong. “Bi-modal deep boltzmann machine based musical emotion classi-
fication.” In: International Conference on Artificial Neural Networks. Springer.
2016, pp. 199–207.
[94] Colin Humphries, Merav Sabri, Kimberly Lewis, and Einat Liebenthal. “Hi-
erarchical organization of speech perception in human auditory cortex.”
In: Frontiers in neuroscience 8 (2014), p. 406.
[95] David Huron. “Perceptual and cognitive applications in music information
retrieval.” In: Perception 10.1 (2000), pp. 83–92.
[96] Patrik N Juslin. “Emotional communication in music performance: A
functionalist perspective and some data.” In: Music perception 14.4 (1997),
pp. 383–418.
[97] Patrik N Juslin. “Emotional reactions to music.” In: The Oxford handbook of
music psychology (2016), pp. 197–213.
[98] Patrik N Juslin. Musical emotions explained: Unlocking the secrets of musical
affect. Oxford University Press, USA, 2019.
[99] Patrik N Juslin, Simon Liljeström, Daniel Västfjäll, and Lars-Olov Lundqvist.
“How does music evoke emotions? Exploring the underlying mechanisms.”
In: (2010).
[100] Patrik N Juslin and Daniel Västfjäll. “Emotional responses to music: The
need to consider underlying mechanisms.” In: Behavioral and brain sciences
31.5 (2008), pp. 559–575.
[101] Rainer Kelz and Gerhard Widmer. “Towards Interpretable Polyphonic
Transcription with Invertible Neural Networks.” In: Proceedings of the 20th
International Society for Music Information Retrieval Conference, ISMIR 2019,
Delft, The Netherlands, November 4-8, 2019. 2019, pp. 376–383.
bibliography 149
[102] Sameer Khurana, Niko Moritz, Takaaki Hori, and Jonathan Le Roux.
“Unsupervised domain adaptation for speech recognition via uncertainty
driven self-training.” In: ICASSP 2021-2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 6553–6557.
[103] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. “Examples are not
enough, learn to criticize! criticism for interpretability.” In: Advances in
neural information processing systems 29 (2016).
[104] Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton,
Patrick Richardson, Jeffrey Scott, Jacquelin A Speck, and Douglas Turnbull.
“Music emotion recognition: A state of the art review.” In: Proceedings of the
11th International Society for Music Information Retrieval Conference, ISMIR
2010. Vol. 86. 2010, pp. 937–952.
[105] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic
Optimization.” In: 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
2015.
[106] Pang Wei Koh and Percy Liang. “Understanding black-box predictions via
influence functions.” In: International conference on machine learning. PMLR.
2017, pp. 1885–1894.
[107] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma
Pierson, Been Kim, and Percy Liang. “Concept bottleneck models.” In:
International Conference on Machine Learning. PMLR. 2020, pp. 5338–5348.
[108] Khaled Koutini, Hamid Eghbal-Zadeh, Matthias Dorfer, and Gerhard
Widmer. “The Receptive Field as a Regularizer in Deep Convolutional
Neural Networks for Acoustic Scene Classification.” In: 2019 27th European
signal processing conference (EUSIPCO). IEEE. 2019, pp. 1–5.
[109] Khaled Koutini, Hamid Eghbal-zadeh, and Gerhard Widmer. “Receptive
field regularization techniques for audio classification and tagging with
deep convolutional neural networks.” In: IEEE/ACM Transactions on Audio,
Speech, and Language Processing 29 (2021), pp. 1987–2000.
[110] Wouter M Kouw and Marco Loog. “An introduction to domain adaptation
and transfer learning.” In: arXiv preprint arXiv:1812.11806 (2018).
[111] Carol L Krumhansl. Cognitive foundations of musical pitch. Oxford University
Press, 2001.
[112] Solomon Kullback and Richard A Leibler. “On information and suffi-
ciency.” In: The annals of mathematical statistics 22.1 (1951), pp. 79–86.
[113] Olivier Lartillot, Tuomas Eerola, Petri Toiviainen, and Jose Fornari. “Multi-
Feature Modeling of Pulse Clarity: Design, Validation and Optimization.”
In: ISMIR. Citeseer. 2008, pp. 521–526.
[114] Olivier Lartillot, Petri Toiviainen, and Tuomas Eerola. “A matlab toolbox
for music information retrieval.” In: Data analysis, machine learning and
applications. Springer, 2008, pp. 261–268.
bibliography 150
[115] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht.
“Sliced wasserstein discrepancy for unsupervised domain adaptation.”
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 10285–10295.
[116] Jing Li, Hongfei Lin, and Lijuan Zhou. “Emotion tag based music re-
trieval algorithm.” In: Asia Information Retrieval Symposium. Springer. 2010,
pp. 599–609.
[117] Tao Li and Mitsunori Ogihara. “Detecting emotion in music.” In: (2003).
[118] Dan Liu, Lie Lu, and Hong-Jiang Zhang. “Automatic mood detection from
acoustic music data.” In: (2003).
[119] Xiaofeng Liu, Chaehwa Yoo, Fangxu Xing, Hyejin Oh, Georges El Fakhri,
Je-Won Kang, Jonghye Woo, et al. “Deep unsupervised domain adaptation:
a review of recent advances and perspectives.” In: APSIPA Transactions on
Signal and Information Processing 11.1 (2022).
[120] Xin Liu, Qingcai Chen, Xiangping Wu, Yan Liu, and Yang Liu. “CNN
based music emotion classification.” In: arXiv preprint arXiv:1704.05665
(2017).
[121] Beth Logan and Ariel Salomon. “A Music Similarity Function Based on
Signal Analysis.” In: ICME. 2001, pp. 22–25.
[122] Ilya Loshchilov and Frank Hutter. “SGDR: Stochastic Gradient Descent
with Warm Restarts.” In: International Conference on Learning Representations.
2017. url: https://openreview.net/forum?id=Skq89Scxx.
[123] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. “Understanding
the effective receptive field in deep convolutional neural networks.” In:
Advances in neural information processing systems 29 (2016).
[124] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using
t-SNE.” In: Journal of machine learning research 9.11 (2008).
[125] Karl F MacDorman Stuart Ough Chin-Chang Ho. “Automatic emotion
prediction of song excerpts: Index construction, algorithm design, and em-
pirical comparison.” In: Journal of New Music Research 36.4 (2007), pp. 281–
299.
[126] Guy Madison and Johan Paulin. “Ratings of speed in real music as a
function of both original and manipulated beat tempo.” In: The Journal of
the Acoustical Society of America 128.5 (2010), pp. 3032–3040.
[127] Jens Madsen, Bjørn Sand Jensen, and Jan Larsen. “Predictive modeling
of expressed emotions in music using pairwise comparisons.” In: Interna-
tional Symposium on Computer Music Modeling and Retrieval. Springer. 2012,
pp. 253–277.
[128] Ricardo Malheiro, Renato Panda, Paulo JS Gomes, and Rui Pedro Paiva.
“Bi-Modal Music Rmotion Tecognition: Novel Lyrical Features and Dataset.”
In: 9th International Workshop on Music and Machine Learning–MML 2016–in
conjunction with the European Conference on Machine Learning and Principles
and Practice of Knowledge Discovery in Databases–ECML/PKDD 2016. 2016.
bibliography 151
[129] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar,
Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis
in python.” In: Proceedings of the 14th python in science conference. Vol. 8.
2015, pp. 18–25.
[130] Gary J McKeown and Ian Sneddon. “Modeling continuous self-report
measures of perceived emotion using generalized additive mixed models.”
In: Psychological methods 19.1 (2014), p. 155.
[131] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
Aram Galstyan. “A survey on bias and fairness in machine learning.” In:
ACM Computing Surveys (CSUR) 54.6 (2021), pp. 1–35.
[132] Albert Mehrabian. “Basic dimensions for a general psychological theory:
Implications for personality, social, environmental, and developmental
studies.” In: (1980).
[133] Tim Miller. “Explanation in artificial intelligence: Insights from the social
sciences.” In: Artificial intelligence 267 (2019), pp. 1–38.
[134] Luca Mion and Giovanni De Poli. “Score-independent audio features for
description of music expression.” In: IEEE Transactions on Audio, Speech,
and Language Processing 16.2 (2008), pp. 458–466.
[135] Saumitra Mishra, Bob L. Sturm, and Simon Dixon. “Local Interpretable
Model-Agnostic Explanations for Music Content Analysis.” In: Proceedings
of the 18th International Society for Music Information Retrieval Conference,
ISMIR 2017, Suzhou, China, October 23-27, 2017. 2017, pp. 537–543.
[136] Christine Mohn, Heike Argstatter, and Friedrich-Wilhelm Wilker. “Percep-
tion of six basic emotions in music.” In: Psychology of Music 39.4 (2011),
pp. 503–517.
[137] Christoph Molnar. Interpretable Machine Learning. A Guide for Making Black
Box Models Explainable. https://christophm.github.io/interpretable-ml-
book/. 2019.
[138] Mitchell Ohriner. “Lyric, rhythm, and non-alignment in the second verse
of Kendrick Lamar’s “Momma”.” In: Music Theory Online 25.1 (2019).
[139] Richard Orjesek, Roman Jarina, Michal Chmulik, and Michal Kuba. “DNN
Based Music Emotion Recognition from Raw Audio Signal.” In: 29th
International Conference Radioelektronika 2019 (RADIOELEKTRONIKA). IEEE.
2019, pp. 1–4.
[140] Andrew Ortony and Terence J Turner. “What’s basic about basic emo-
tions?” In: Psychological review 97.3 (1990), p. 315.
[141] Elias Pampalk, Simon Dixon, and Gerhard Widmer. “Exploring music
collections by browsing different views.” In: Computer Music Journal 28.2
(2004), pp. 49–62.
[142] Renato Eduardo Silva Panda. “Emotion-based analysis and classification
of audio music.” PhD thesis. 00500:: Universidade de Coimbra, 2019.
bibliography 152
[143] Renato Eduardo Silva Panda, Ricardo Malheiro, Bruno Rocha, António
Pedro Oliveira, and Rui Pedro Paiva. “Multi-modal music emotion recog-
nition: A new dataset, methodology and comparative analysis.” In: 10th In-
ternational Symposium on Computer Music Multidisciplinary Research (CMMR
2013). 2013, pp. 570–582.
[144] Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. “Audio
Features for Music Emotion Recognition: a Survey.” In: IEEE Transactions
on Affective Computing (2020), pp. 1–1. doi: 10.1109/TAFFC.2020.3032373.
[145] Renato Panda, Ricardo Malheiro, and Rui Pedro Paiva. “Novel audio
features for music emotion recognition.” In: IEEE Transactions on Affective
Computing 11.4 (2018), pp. 614–626.
[146] Alessia Pannese, Marc-André Rappaz, and Didier Grandjean. “Metaphor
and music emotion: Ancient views and future directions.” In: Consciousness
and Cognition 44 (2016), pp. 61–71.
[147] Jose Pinheiro, Douglas Bates, Saikat DebRoy, Deepayan Sarkar, and R Core
Team. nlme: Linear and Nonlinear Mixed Effects Models. R package version
3.1-152. 2021. url: https://CRAN.R-project.org/package=nlme.
[148] Robert Plutchik. “The nature of emotions: Human emotions have deep
evolutionary roots, a fact that may explain their complexity and provide
tools for clinical practice.” In: American scientist 89.4 (2001), pp. 344–350.
[149] Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas
F. Ehmann, and Xavier Serra. “End-to-end Learning for Music Audio
Tagging at Scale.” In: 19th International Society for Music Information Retrieval
Conference (ISMIR2018). Paris, 2018.
[150] Jonathan Posner, James A Russell, Andrew Gerber, Daniel Gorman, Tiziano
Colibazzi, Shan Yu, Zhishun Wang, Alayar Kangarlu, Hongtu Zhu, and
Bradley S Peterson. “The neurophysiological bases of emotion: An fMRI
study of the affective circumplex using emotion-denoting words.” In:
Human brain mapping 30.3 (2009), pp. 883–895.
[151] Jonathan Posner, James A Russell, and Bradley S Peterson. “The circum-
plex model of affect: An integrative approach to affective neuroscience,
cognitive development, and psychopathology.” In: Development and psy-
chopathology 17.3 (2005), pp. 715–734.
[152] Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. “Inter-
pretable Data-Based Explanations for Fairness Debugging.” In: arXiv
preprint arXiv:2112.09745 (2021).
[153] Verena Praher, Katharina Prinz, Arthur Flexer, and Gerhard Widmer. “On
the Veracity of Local, Model-agnostic Explanations in Audio Classification:
Targeted Investigations with Adversarial Examples.” In: arXiv preprint
arXiv:2107.09045 (2021).
[154] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark
Chen. “Hierarchical text-conditional image generation with clip latents.”
In: arXiv preprint arXiv:2204.06125 (2022).
bibliography 153
[155] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. “”Why Should I
Trust You?”: Explaining the Predictions of Any Classifier.” In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM, 2016,
pp. 1135–1144. doi: 10.1145/2939672.2939778.
[156] Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. “”Why Should I
Trust You?”: Explaining the Predictions of Any Classifier.” In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM, 2016,
pp. 1135–1144. doi: 10.1145/2939672.2939778.
[157] Peter J Richerson, Robert Boyd, and Joseph Henrich. “Gene-culture coevo-
lution in the age of genomics.” In: Proceedings of the National Academy of
Sciences 107.Supplement 2 (2010), pp. 8985–8992.
[158] Peter J Rousseeuw and Katrien Van Driessen. “A fast algorithm for the
minimum covariance determinant estimator.” In: Technometrics 41.3 (1999),
pp. 212–223.
[159] James A Russell. “A circumplex model of affect.” In: Journal of personality
and social psychology 39.6 (1980), p. 1161.
[160] James A Russell. “Core affect and the psychological construction of emo-
tion.” In: Psychological review 110.1 (2003), p. 145.
[161] James A Russell and Beverly Fehr. “Fuzzy concepts in a fuzzy hierarchy:
varieties of anger.” In: Journal of personality and social psychology 67.2 (1994),
p. 186.
[162] Ulrich Schimmack and Alexander Grob. “Dimensional models of core
affect: A quantitative comparison by means of structural equation model-
ing.” In: European Journal of Personality 14.4 (2000), pp. 325–345.
[163] Erik M Schmidt and Youngmoo E Kim. “Learning emotion-based acoustic
features with deep belief networks.” In: 2011 IEEE workshop on applications
of signal processing to audio and acoustics (Waspaa). IEEE. 2011, pp. 65–68.
[164] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna
Vedantam, Devi Parikh, and Dhruv Batra. “Grad-cam: Visual explanations
from deep networks via gradient-based localization.” In: Proceedings of the
IEEE international conference on computer vision. 2017, pp. 618–626.
[165] Yading Song, Simon Dixon, Marcus T Pearce, and Andrea R Halpern.
“Perceived and induced emotion responses to popular music: Categorical
and dimensional models.” In: Music Perception: An Interdisciplinary Journal
33.4 (2016), pp. 472–492.
[166] Erik Strumbelj and Igor Kononenko. “An efficient explanation of individ-
ual classifications using game theory.” In: The Journal of Machine Learning
Research 11 (2010), pp. 1–18.
[167] Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. “Unsupervised Do-
main Adaptation through Self-Supervision.” In: arXiv preprint arXiv:1909.11825
(2019).
bibliography 154
[168] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbig-
niew Wojna. “Rethinking the inception architecture for computer vision.”
In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 2818–2826.
[169] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and
Zbigniew Wojna. “Rethinking the Inception Architecture for Computer
Vision.” In: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (2016), pp. 2818–2826.
[170] Robert E Thayer. The biopsychology of mood and arousal. Oxford University
Press, 1990.
[171] WIRED.com. Jacob Collier Plays the Same Song In 18 Increasingly Complex
Emotions — WIRED. 2020. url: https://www.youtube.com/watch?v=
EWHpdmDHrn8 (visited on 09/07/2022).
[172] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng,
and Tao Qin. “Generalizing to unseen domains: A survey on domain
generalization.” In: arXiv preprint arXiv:2103.03097 (2021).
[173] David Watson, Lee Anna Clark, and Auke Tellegen. “Development and
validation of brief measures of positive and negative affect: the PANAS
scales.” In: Journal of personality and social psychology 54.6 (1988), p. 1063.
[174] Lage Wedin. “A multidimensional study of perceptual-emotional qualities
in music.” In: Scandinavian journal of psychology 13.1 (1972), pp. 241–257.
[175] Felix Weninger, Florian Eyben, and Björn Schuller. “On-line continuous-
time music mood regression with deep recurrent neural networks.” In:
2014 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE. 2014, pp. 5412–5416.
[176] Gerhard Widmer. “Applications of machine learning to music research:
Empirical investigations into the phenomenon of musical expression.”
In: Machine Learning, Data Mining and Knowledge Discovery: Methods and
Applications. Wiley & Sons, Chichester (UK) (1998).
[177] Gerhard Widmer. “Getting closer to the essence of music: The Con Espres-
sione Manifesto.” In: ACM Transactions on Intelligent Systems and Technology
(TIST) 8.2 (2017), p. 19.
[178] Minz Won, Sanghyuk Chun, and Xavier Serra. “Toward Interpretable
Music Tagging with Self-Attention.” In: CoRR abs/1906.04972 (2019). arXiv:
1906.04972.
[179] Cheng Yang. Content-based music retrieval on acoustic data. Stanford Univer-
sity, 2003.
[180] Dan Yang and Won-Sook Lee. “Disambiguating Music Emotion Using
Software Agents.” In: ISMIR. Vol. 4. 2004, pp. 218–223.
[181] Li-Chia Yang and Alexander Lerch. “On the evaluation of generative mod-
els in music.” In: Neural Computing and Applications 32.9 (2020), pp. 4773–
4784.
bibliography 155
[182] Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. “Music
emotion classification: A regression approach.” In: 2007 IEEE International
Conference on Multimedia and Expo. IEEE. 2007, pp. 208–211.
[183] Marcel Zentner, Didier Grandjean, and Klaus R Scherer. “Emotions evoked
by the sound of music: characterization, classification, and measurement.”
In: Emotion 8.4 (2008), p. 494.
[184] Kejun Zhang, Hui Zhang, Simeng Li, Changyuan Yang, and Lingyun Sun.
“The PMEmo Dataset for Music Emotion Recognition.” In: Proceedings of
the 2018 ACM on International Conference on Multimedia Retrieval. ICMR ’18.
Yokohama, Japan: ACM, 2018, pp. 135–142. isbn: 978-1-4503-5046-4. doi:
10.1145/3206025.3206037.
[185] Youshan Zhang. “A Survey of Unsupervised Domain Adaptation for
Visual Recognition.” In: arXiv preprint arXiv:2112.06745 (2021).
[186] Han Zhao, Remi Tachet Des Combes, Kun Zhang, and Geoffrey Gor-
don. “On learning invariant representations for domain adaptation.” In:
International Conference on Machine Learning. PMLR. 2019, pp. 7523–7532.