Dissertation Kamehkhosh
Dissertation Kamehkhosh
Dissertation
von
Iman Kamehkhosh
Dortmund
2017
Tag der mündlichen Prüfung: 26.02.2018
Gutachter:
Prof. Dr. Dietmar Jannach
Prof. Dr. Günter Rudolph
Abstract
Technological advances in the music industry have dramatically changed how people
access and listen to music. Today, online music stores and streaming services
offer easy and immediate means to buy or listen to a huge number of songs. One
traditional way to find interesting items in such cases when a vast amount of
choices are available is to ask others for recommendations. Music providers utilize
correspondingly music recommender systems as a software solution to the problem of
music overload to provide a better user experience for their customers. At the same
time, an enhanced user experience can lead to higher customer retention and higher
business value for music providers.
iii
and sequence-aware algorithms. Moreover, a number of challenges, such as personal-
izing next-track music recommendations and generating recommendations that are
coherent with the user’s listening history are discussed. Furthermore, some common
approaches in the literature to determine relevant quality criteria for next-track
music recommendations and to evaluate the quality of such recommendations are
presented.
The second part of the thesis contains a selection of the author’s publications on
next-track music recommendation as follows.
iv
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Music Recommendation Problem . . . . . . . . . . . . . . . . . . 3
1.2.1 Characterization of the Music Recommendation Problem . . . 3
1.2.2 Music Recommendation Scenarios . . . . . . . . . . . . . . . 5
1.2.3 Particularities and Challenges of Music Recommendation . . . 6
1.3 Next-Track Music Recommendation . . . . . . . . . . . . . . . . . . . 8
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.1 Analyzing the Characteristics of Shared Playlists for Music
Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.2 Beyond “Hitting the Hits” – Generating Coherent Music Playlist
Continuations with the Right Tracks . . . . . . . . . . . . . . 13
1.6.3 Biases in Automated Music Playlist Generation: A Comparison
of Next-Track Recommending Techniques . . . . . . . . . . . 13
1.6.4 Leveraging Multi-Dimensional User Models for Personalized
Next-Track Music Recommendation . . . . . . . . . . . . . . . 13
1.6.5 User Perception of Next-Track Music Recommendations . . . . 14
1.6.6 A Comparison of Frequent Pattern Techniques and a Deep
Learning Method for Session-Based Recommendation . . . . . 14
v
3 Evaluation of Next-Track Recommendations 33
3.1 How to Determine Quality Criteria for Next-Track Recommendations 33
3.1.1 Analyzing the Characteristics of User Playlists . . . . . . . . . 33
3.1.2 Conducting User Studies . . . . . . . . . . . . . . . . . . . . . 35
3.2 Evaluation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Log Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Objective Measures . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Comparison with Hand-Crafted Playlists . . . . . . . . . . . . 45
3.2.4 User Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 Conclusion 55
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Bibliography 59
List of Figures 73
Publications 75
Analyzing the Characteristics of Shared Playlists for Music Recommendation 77
Beyond “Hitting the Hits” – Generating Coherent Music Playlist Continua-
tions with the Right Tracks . . . . . . . . . . . . . . . . . . . . . . . . 85
Biases in Automated Music Playlist Generation: A Comparison of Next-Track
Recommending Techniques . . . . . . . . . . . . . . . . . . . . . . . 87
Leveraging Multi-Dimensional User Models for Personalized Next-Track
Music Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 89
User Perception of Next-Track Music Recommendations . . . . . . . . . . . 91
A Comparison of Frequent Pattern Techniques and a Deep Learning Method
for Session-Based Recommendation . . . . . . . . . . . . . . . . . . . 93
vi
Introduction 1
1.1 Motivation
No one knows how the story of music began but there is evidence of our caveman
ancestors making flutes and whistles out of animal bones. Through its ongoing
progress, music has become a massive global phenomenon. Today, it is hard for us
to imagine a time – in the days before music could be recorded – when people could
go weeks without hearing any music at all [Goo13].
The invention of recording and playback devices in the late 19th century changed
the music listening from a “live-only” event in concert halls or churches to a more
intimate experience. Portable cassette players introduced another turning point in
how people listened to music by making music mobile. With the creation of compact
disc and the invention of the MP3 format, music entered the digital era. The launch
of the first websites for downloading and sharing music, e.g., eMusic1 and Napster2 ,
changed how people accessed music yet again.
Almost a century after the first radio music was broadcast in 1906, Last.fm3 launched
the first ad-funded Internet radio platform offering personalized music. In recent
years, music streaming has become the dominant way of consuming music and the
most profitable source in the music industry, as in the first half of 2017, 75% of those
who consumed music used streaming services and 62% of the U.S. music industry
revenues came from streaming [Fri16].
1
https://www.emusic.com/
2
https://www.napster.com/
3
https://www.last.fm/
1
A remarkable impact of digitalization and the Internet on music is the ease of
immediate access to a huge number of songs. Major music streaming services, e.g.,
Spotify4 , and online music stores like iTunes5 have over 30 million songs [Pre17],
adding thousands of new songs every month. All this music can be accessed anytime
through an online application or an app on a mobile device. Besides its potential
for discovering new songs and artists, this vast amount of data can easily lead to
information anxiety [Wur90] for music consumers and make it difficult for them to
come to a decision.
One of the first music recommender systems was an email-based service called
Ringo [Sha+95]. Users should first rate a list of artists and state how much they
like to listen to them. Based on that, users could ask Ringo to suggest new artists
or albums that they will like or dislike and to also get a prediction of how much
they will like each one. Technological progress in the music domain together
with changes in our listening habits have, however, opened new opportunities for
other recommendation scenarios. For instance, the discovery feature of Spotify
provides users with personalized recommendations through weekly playlists and a
playlist of newly released tracks that might be interesting to them. Furthermore,
non-personalized recommendations of trending tracks and curated playlists, are
common on most music platforms.
4
https://www.spotify.com
5
https://www.apple.com/itunes/
6
https://www.pandora.com/
7
https://www.deezer.com
8
https://play.google.com/music/listen
9
https://www.apple.com/music/
2 Chapter 1 Introduction
The remainder of this chapter is mainly dedicated to the music recommendation
problem in general (Section 1.2) and a short description of next-track music rec-
ommendation as a specific form of music recommendation (Section 1.3). Next, the
research questions that are addressed in this thesis are discussed (Section 1.4). At
the end of this chapter, an outline of the remainder of this thesis (Section 1.5) and a
list of the publications that are included in it are presented (Section 1.6).
Like in the general field of recommender systems [Voz+03], the basic entities of a
music recommender system are (1) the user, i.e., the music listener or consumer who
interacts with a streaming service, a music player, or an online music store and (2)
the item, i.e., the music item, such as track, artist, or playlist.
Figure 1.1 illustrates the components of a music recommender system. Schedl et al.
[Sch+17c] categorize the input of a recommender system into two groups of user
inputs and item inputs. The user inputs consist of (i) the listener’s background like her
demographic data, music experience and preferences, (ii) the listener’s intent, e.g.,
changing mood or finding a motivation, and (iii) the listener’s context like her mood,
the time of the day or the current social environment of the listener. Schedl et al.
[Sch+17c] also introduce three components for the item inputs. (i) The content
of music that is the musical characteristics of a track like its rhythm, timbre, or
melody, (ii) the purpose of music, i.e., the intention of the author of music that could
be political, social, etc., and (iii) the context of music that can be determined, for
example, through cover artwork, video clips, or social tags.
The goal of music recommender systems is then to predict the preferences of a user
for music items, using the input data, and to generate recommendations based on
these predictions. The generated music recommendations could be either about
novel tracks, artists, or albums that are new to the user, or items that the user already
knows but might have forgotten about them or that might match her current context
or listening intention.
• Listener Background
• Listener Intent Output
• Listener Context
Music Recommendation • Prediction
Algorithm • Recommendation
Item Inputs
• Music Content
• Music Purpose
• Music Context
4 Chapter 1 Introduction
1.2.2 Music Recommendation Scenarios
Recommend similar items. A list of similar tracks to the currently playing track or
similar artists to the user’s favorite artist can also be found on many music services.
The similarity of tracks is often determined through audio content analysis of features
such as tempo, timbre, or pitch and is mainly addressed in the Music Information
Retrieval (MIR) literature, see, e.g., [Cas+08] or [Mül15]. For artist similarity,
collaborative filtering approaches and text analysis of user-generated content and
lyrics have been applied in the literature [Kne+13].
• recommending a curated playlist based, e.g., on the time of the day, day of the
week, or different moods and activities that can be either non-personalized
like the “Genres & Moods” playlists of Spotify or personalized like “My Mixes”
of Apple Music;
Create radio stations. Making music recommendations for radio stations is another
scenario in this domain that can, again, be either personalized or non-personalized.
In contrast to the playlist-recommendation scenario, in which recommendations
are presented “in batch”, i.e., as a whole list, radio station recommendations are
sequential and usually presented one after the other [Sch+17c]. One application
area of such recommendations are broadcasting radios, which often use playlists made
by disc jockeys containing popular tracks and targeting certain audiences [Eke+00].
In this case, users have no interaction with the system and cannot influence the
recommendations. A more recent application area for such recommendations are
virtual radio stations, in which a virtually endless playlist given a seed track or artist
is created [Cli06; Moe+10; Jyl+12]. The process of creating virtual radio stations
can be personalized based on the music taste and the immediate feedbacks, e.g.,
“thumbs-ups/downs” and “skip” actions of the user.
In principle, some of the scenarios discussed in the previous section can be addressed
with approaches from other recommendation domains like e-commerce. For instance,
collaborative filtering approaches can be utilized to generate a list of relevant tracks
for a user based on her previously liked tracks. Or, session-based recommending
techniques from e-commerce can be applied for generating radio stations given the
6 Chapter 1 Introduction
user’s recently played tracks. However, there are specific challenges or aspects that
are at least more relevant in music recommendation scenarios. In this section, some
particularities and challenges of the music recommendation problem are discussed,
which are partly adopted from [Sch+17c].
Available music catalog is comparably large. While movie streaming services can
typically have up to tens of thousands of movies, the number of tracks on Spotify,
for instance, is more than 30 million, while new items are constantly added to the
catalog, too. The main challenge in this context, especially for academic experiments,
is the scalability of the recommendation approaches.
8 Chapter 1 Introduction
User’s recent listening history (seed tracks)
♪ ♪ ♪ ♪ ♪ ♪ ♪
Background
Pool of tracks
knowledge
♪ ♪ ♪ ♪ ♪ ♪ ♪ …
Next-track recommendations
Figure 1.2 illustrates the next-track recommendation process. Bonnin et al. [Bon+14]
define “automatic playlist generation” as selecting a sequence of tracks that fulfill
the target characteristics of the playlist from an available pool of tracks using a
background knowledge database. This definition can be adopted for the next-track
recommendation process in both of the above mentioned scenarios.
As will be discussed later in this thesis, additional information such as general user
preferences or contextual and emotional information about users can be utilized in
the recommendation process to improve the quality of next-track recommendations.
In this thesis, a number of research questions are considered regarding the topics
that have not yet been fully investigated in the research field of next-track music
recommendation. Details of how these questions have been developed and why it is
important to seek answers for them will be discussed in the following chapters.
2. How can we combine patterns in the users’ listening behavior with metadata
features (e.g., artist, genre, release year), audio features (e.g., tempo or
loudness), and personal user preferences to achieve higher recommendation
accuracy? How can we utilize these input signals to optimize the selection of
next tracks to create recommendations that are more coherent with the user’s
listening history?
3. How can we extract long-term preference signals like favorite tracks, favorite
artists, or favorite topics from the users’ long-term listening behavior for
personalizing the next-track music recommendations? How can these signals
be effectively combined with the user’s short-term interests?
4. To what extent do the objective quality measures that are largely used in offline
experiments to evaluate the quality of next-track recommendations correlate
with the quality perception of music listeners?
In the context of this thesis, a number of novel algorithms and approaches were
developed to answer the above-mentioned research questions. The effectiveness
and usefulness of the proposed algorithms and approaches were explored through
several offline and online experiments.
10 Chapter 1 Introduction
track co-occurrence patterns in publicly shared playlists, music and metadata
features as well as personal user preferences. In the second phase, we optimize
the set of next tracks to be played with respect to the user’s individual tendency
in different dimensions, e.g., artist diversity. Technically, we re-rank the tracks
selected in the previous phase in a way that the resulting recommendation list
matches the characteristics of the user’s recent listening history.
4. We design and conduct an online user study (N=277) to assess to what extent
the outcomes of offline experiments in the music domain correlate with the
users’ quality perception. Based on the insights obtained from our offline
experiments, we state four research questions regarding (1) the suitability of
manually created playlists for the evaluation of next-track recommending tech-
niques, (2) the effect of considering additional signals, e.g., musical features or
metadata into the recommendation process on the users’ quality perception,
(3) the users’ perception of popular recommendations, and (4) the effect of
familiar recommendations on the subjective quality perception of users.
The rest of this thesis is structured as follows. Chapter 2 reviews the next-track
recommendation algorithms from the research literature. These algorithms are cate-
gorized into the four general groups of content-based filtering, collaborative filtering,
co-occurrence-based, and sequence-aware algorithms. Afterwards, the results of a
multi-dimensional comparison of a number of these algorithms from [Kam+17a]
are presented. Next, with respect to the challenges of next-track music recommen-
dation, different approaches for personalizing next-track music recommendations
based on the users’ long-term listening preferences from [Jan+17a] are presented.
Finally, the algorithmic approaches from the literature for balancing the trade-offs
between accuracy and other quality criteria like artist diversity are reviewed. In
this context, the proposed approach in [Jan+15a] to generate optimized next-track
recommendations in terms of different quality dimensions is presented.
Chapter 4 concludes the first part of this thesis by summarizing the discussed topics
and by presenting future perspectives for the next-track music recommendation
research. The second part of this thesis includes six of the author’s publications that
are listed in the next section.
1.6 Publications
The individual contributions of the author to the included publications in this thesis
are as follows. The complete list of the author’s publications can be found in the
appendix.
Dietmar Jannach, Iman Kamehkhosh, and Geoffray Bonnin. “Analyzing the Char-
acteristics of Shared Playlists for Music Recommendation”. In: Proceedings of the
6th Workshop on Recommender Systems and the Social Web at ACM RecSys. 2014
This paper was a joint work with Dietmar Jannach and Geoffray Bonnin. The author
of this thesis was involved in collecting the required music data, designing and
implementing the experiments as well as evaluating the results.
12 Chapter 1 Introduction
1.6.2 Beyond “Hitting the Hits” – Generating Coherent Music Playlist
Continuations with the Right Tracks
Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. “Beyond "Hitting the
Hits": Generating Coherent Music Playlist Continuations with the Right Tracks”.
In: Proceedings of the 9th ACM Conference on Recommender Systems. RecSys ’15.
2015, pp. 187–194
This work was written together with Dietmar Jannach and Lukas Lerche. The
proposed “recommendation-optimization” approach was designed and developed by
the author of this thesis in collaboration with the other authors of the paper. The
author of this thesis was responsible for conducting the experiments and evaluating
the results and wrote parts of the text.
This study was a joint effort with Dietmar Jannach and Geoffray Bonnin. The
author of this thesis contributed to the data collection, experimental design and
implementation, and analyzing the results. He also wrote parts of the text.
The paper was written with Dietmar Jannach and Lukas Lerche. The author of this
thesis contributed to all parts of the paper and wrote parts of the text. The first
version of this paper was presented at a workshop [Kam+16].
1.6 Publications 13
1.6.5 User Perception of Next-Track Music Recommendations
This paper was written together with Dietmar Jannach. The author of this thesis
contributed to all parts of the paper (including the design of the experiment, the
implementation of the application, and the evaluation of the collected data) and
wrote the major part of the text.
The paper is the result of a joint work with Dietmar Jannach and Malte Ludewig.
The experiments and the evaluation of the results were performed by the author of
this thesis who also wrote the major part of the text.
14 Chapter 1 Introduction
2
Next-Track Recommendation
Algorithms
A variety of algorithmic approaches have been proposed in the literature for the
next-track recommendation task or the playlist continuation problem. In this chapter,
these approaches are organized in the four categories of content-based filtering,
collaborative filtering, co-occurrence-based, and sequence-aware. After a brief
review of the research literature on next-track music recommendation algorithms
and presenting the results of a multi-faceted comparison of a number of these
approaches, two key challenges in this context along with our proposed approaches
to deal with them will be introduced at the end of this chapter. These challenges
relate to personalizing next-track recommendations and to balancing the possible
trade-offs between accuracy and other quality factors, e.g., diversity.
10
For a detailed discussion on different types of content information see [Bog13].
15
2.1.1 Audio-Based Approaches
Extracting and processing audio content such as pitch, loudness [Blu+99], chord
changes [Tza02], and mel-frequency cepstral coefficients (MFCC) [Tza+02; Bog+10]
from a music file, using, e.g., machine learning approaches [Sch+11], is the main
focus of the music information retrieval (MIR) research, see, e.g., [Cas+08; Mül15];
or [Wei+16].
More recent approaches based on audio content utilize deep learning techniques
[Ben09] for both the feature extraction task [Hum+12; Die+14] and the recom-
mendation problem [Oor+13; Wan+14]. For instance, Humphrey et al. [Hum+12]
reviewed deep architectures and feature learning as alternative approaches for
traditional feature engineering in content-based MIR tasks. Moreover, Dieleman
et al. [Die+14] investigated the capability of convolutional neural networks (CNNs)
[LeC+98] to learn features from raw audio for the tag prediction task. Their results
showed that CNNs are able to autonomously discover frequency decompositions as
well as phase and translation-invariant features.
One of the earliest deep learning content-based approaches for music recommenda-
tion was proposed by Oord et al. [Oor+13]. They compared a traditional approach
using a “bag-of-words” representation of the audio signals with deep convolutional
neural networks in predicting latent factors from music audio. They evaluated the
predictions by using them for music recommendation and concluded that using
CNNs can lead to novel music recommendations and reduce the popularity bias. In
a similar work, Wang et al. [Wan+14] introduced a content-based recommenda-
tion model using a probabilistic graphical model and a deep belief network (DBN)
[Hin+06] which unified feature learning and recommendation phases. Their experi-
ments on The Echo Nest Taste Profile Subset [McF+12b] showed that the proposed
deep learning method outperforms traditional content-based approaches in terms of
predictive performance in both cold and warm start situations.
In contrast to audio content, metadata is not extracted from the music file. In fact,
metadata-based approaches rely primarily on information, such as artist, album and
genre [Bog+11; Aiz+12], release year [VG+05], or lyrics [Coe+13] of the tracks.
Matrix factorization (MF) methods [Pan+08; Kor+09], which aim to find latent
features that determine how a user interacts with an item, have been developed to
alleviate the data sparsity problem of CF. In the music domain, for instance, Spotify
uses a distributed MF method based on listening logs for its discovery feature [Joh14;
Joh+15]. In this particular implementation of matrix factorization, the entries in the
user-item matrix are not necessarily the explicit item ratings, but they correspond
to the number of times a user has played each track.11 Moreover, a distributed
computing architecture based on the map-reduce scheme was utilized by Spotify to
cope with the huge amount of required computations.
♪ ♪ ♪ ♪ ♪ ♪ ♪ ? ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
ℎ ∩ 𝑠𝑠𝑖𝑖 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
𝑆𝑆𝑆𝑆𝑆𝑆 ℎ, 𝑆𝑆𝑖𝑖 =
ℎ . 𝑠𝑠𝑖𝑖 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
Figure 2.1: The proposed kNN approach in [Bon+14]. The k nearest neighbors of the
recent listening history of the user are computed based on the cosine similarity
of the tracks in the listening history and the tracks in the past listening sessions
in the training data.
13
Co-occurrence-based methods are also known as frequent pattern approaches in the literature.
The presented algorithms in the previous section do not consider the sequence
of the tracks in a listening session or playlist for which the next track should be
recommended. To address this problem, a number of sequence-aware techniques
were also investigated in the literature.
The problem of predicting the next actions of users based solely on their sequence
of actions in the current session is referred to in the literature as “session-based
recommendation”. Many of the algorithms that have been reviewed in this chapter
such as frequent pattern mining algorithms or approaches that are based on sequence
modeling can be employed to address the session-based recommendation problem.
Algorithms. The RNN model applied in [Hid+15] uses Gated Recurrent Units to
deal with the vanishing or exploding gradient problem. Figure 2.2 depicts the
general architecture of this model. The network takes the current item of the session
GRU layer
GRU layer
GRU layer
...
Figure 2.2: General architecture of the GRU-based RNN model, adapted from [Hid+15].
The session-based kNN method used in [Kam+17a] is similar to the kNN approach
from [Bon+14] described above (see Figure 2.1). It looks for the k most similar
past sessions (neighbors) in the training data based on the set of items in the current
session. However, to reduce the computational complexity, only a limited number
of the most recent sessions are considered in this process. The other kNN-based
approach in this work – in addition to the item co-occurrence patterns in sessions
– considers the sequence of the items in a session. More precisely, a track will be
considered recommendable, only if it appears in the neighbor listening session directly
after the last item of the current listening history.
Both of the rule-mining approaches deployed in [Kam+17a] define a rule for every
two items that appear together in the training sessions (rule size of two). For the
association rules method the weight of each rule is the number of co-occurrences
of the two items, while the sequential rules technique in addition considers the
sequence of the items in a session.
Datasets. For the e-commerce domain, we chose the ACM RecSys 2015 Challenge
dataset (of 8 million shopping sessions) and a public dataset published for the TMall
competition (of 4.6 million shopping logs from the Tmall.com website). For the music
datasets, we used two subsets of listening logs collections from the #nowplaying
dataset [Zan+12] (with 95,000 listening sessions) and the 30Music dataset [Tur+15]
14
In the implemented version of this algorithm by Hidasi et al. [Hid+15] that is publicly available at
https://github.com/hidasib/GRU4Rec, only one GRU layer is used.
• For the listening logs datasets, the sequence-aware methods worked in general
better than the sequence-agnostic approaches. In particular, the sequential-
rules approach outperformed GRU 4 REC.
Biases of the algorithms. With respect to popularity bias and catalog coverage, the
results showed that
• the sequence-agnostic methods (e.g., kNN and association rules) are more
biased towards popular items and focus their recommendations on a smaller
set of items;
2.6 Challenges
In the introduction of this thesis, the particularities and challenges of music recom-
mendation were discussed. In general, the subjectiveness and context-dependency
of music make the recommendation task more challenging. In this section, we
present two challenging aspects of next-track recommendation that have not been
investigated to a large extent in the music recommendation literature, even though
they have effects on the user’s quality perception of music recommendations.
Most of the reviewed algorithms in the previous sections focus solely on the user’s
recent listening behavior or current context and do not consider the long-term pref-
erences of the user. In one of the few attempts in the literature where the next-track
recommendation is personalized, Wu et al. [Wu+13] proposed personalized Markov
embedding (PME) for the next-track recommendation problem in online karaokes.
Technically, they first embed songs and users into an Euclidean space in which
distances between songs and users reflect the strength of their relationships. This
Overall scorer
Relevance
Additional score for the
𝑆𝑆1 , 𝑆𝑆2 , … , 𝑆𝑆𝑚𝑚
Scorer ∑ target track
𝑤𝑤𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
Figure 2.3: Illustration of the multi-faceted scoring scheme to combine a baseline algorithm
with personalization components in a weighted approach [Jan+17a].
Technically, the overall relevance score scoreoverall for a possible next track t∗ , given
the current listening history h, is computed as follows [Jan+17a],
where P is a set of personalization strategies, each with a different weight wpers , and
wbase is the weight of the baseline. The functions scorebase and scorepers compute
the baseline score and the scores of the individual personalization components,
respectively.
2.6 Challenges 25
The personalization approaches proposed in this work consider the following signals,
including favorite tracks, favorite artists, topic similarity, extended neighborhoods, and
online social friends.
Favorite tracks. Users usually listen to their favorite tracks over and over again.
This simple pattern in the listening behavior of users can be adopted by recom-
mender systems to generate personalized recommendations. We examined different
strategies to determine which tracks from the user’s previously played tracks to
recommend. For instance, we selected generally popular tracks from her listening
history, the tracks which have been played at the same time of the day in the past,
or the tracks from the same artists as in the current session. Since users tend to
re-consume more recent items [And+14], we assigned more weights to tracks that
were more recently listened to by the user.
Favorite artists. Just like favorite tracks, music enthusiasts also have their favorite
artists. The idea here is to recommend tracks of not only the artists that user liked
(played) in the past, but also to consider the popular tracks of similar artists in
the recommendation process. The relevance of a candidate track then depends on
its general popularity and the similarity between its artist and the user’s favorite
artists. As a measure of similarity between two artists, one can, e.g., simply count
how often two artists appear together in the users’ listening sessions or playlists,
see Section 2.3.
Topic similarity. The assumption of this personalization score is that some users
listen only to certain types of music, for example, mostly romantic songs or instru-
mental music. One way to determine the topic of a track is to use social tags that
are assigned to them on music platforms like Last.fm. In this context, a higher score
is assigned to tracks that are annotated with similar tags. The similarity can be
computed based on, for example, the cosine similarity between two TF-IDF encoded
track representations. In general, however, other musical features can also be used
to determine a content-based similarity of the tracks, see Section 2.1.2.
Online social friends. Finally, the last personalization approach that we considered
in [Jan+17a] takes the musical preferences of the user’s online social friends into
account, as their listening behavior can influence the user’s preferences. In our
experiments, we explored the value of recommending the favorite tracks of the user’s
Twitter friends and gave more weight to (popular) friends with more followers.
The performance of these personalizing approaches were then evaluated using hand-
crafted playlist collections and listening logs datasets. To measure accuracy, we
computed the track hit rate of the approaches as follows. The data was split into
training and test sets, and the last track of each playlist or listening session in the test
set was hidden. The goal was to predict this last hidden track. A “hit” was counted
when the hidden track was in the top-n recommendation list of an algorithm. In
addition to accuracy, the diversity and coherence of the resulting recommendations
based on the artists – and where applicable based on the tags – were also assessed.
Apart from numerous algorithmic contributions, a side effect of the Netflix prize15
[Ben+07] was an enormous focus of researchers on accuracy measures like the mean
absolute error (MAE) or root mean squared error (RMSE) for evaluating the quality
of recommendation algorithms [Jan+12]. However, several recent works indicate
that optimizing accuracy could be insufficient in many real-world recommendation
scenarios and there are other quality criteria that could affect the user’s quality
perception of the recommendations [Cre+12; Jan+15c].
15
In 2006, Netflix released a dataset containing 100 million anonymous movie ratings of its costumers
for a public challenge on improving the accuracy of it recommender system, Cinematch, by 10%.
2.6 Challenges 27
In the music domain, in particular, the proposed approaches in the research literature
are most often evaluated based on historical data and are mainly aimed to identify
tracks that users actually listened to, using performance measures like the mean
reciprocal rank (MRR), precision, or recall. Although the relevant quality criteria
for a playlist or listening session might vary based on the context or intent of the
listeners, some works have attempted to determine additional quality criteria that
are relevant to find the right next tracks. Examples of such quality factors are
artist diversity, homogeneity of musical features, or the transitions between the
tracks. Common approaches to determine such factors are to conduct user studies
[Kam+12a] or to analyze the characteristics of published user playlists [Sla+06;
Jan+14], see Section 3.1.
When quality factors other than prediction accuracy are considered in the recom-
mendation process, it can become necessary to find a trade-off as improving one
quality factor could impact another one negatively. Some works in the research
literature on recommender systems have also proposed approaches to deal with
such trade-off situations and to improve recommendations by considering additional
quality factors. For instance, the work presented in [Ado+12] tried to re-rank the
first n items of an accuracy optimized list in a way to increase or balance diversity
across all users. Bradley et al. [Bra+01] and Ziegler et al. [Zie+05] also aimed to
optimize diversity, this time in terms of intra-list similarity, using techniques that
reorder the recommendations based on their dissimilarity to each other. Finally,
Zhang et al. [Zha+08] proposed a binary optimization approach to ensure a balance
between accuracy and diversity of the top-n recommendations.
The main shortcomings of the proposed approaches are that (1) they consider only
two quality factors, for example, accuracy versus diversity and (2) do not take the
tendencies of individual users into account. Some more recent works try to overcome
these limitations [Oh+11; Shi+12; Rib+14; Kap+15]. For instance, in [Kap+15] a
regression model is proposed to predict the user needs for novel items from her past
interactions in each recommendation session individually. These approaches also
have their limitations. For instance, the above mentioned approach from [Kap+15]
is designed for only a specific quality factor, i.e., novelty. Furthermore, the proposed
balancing strategies are often integrated in the specific algorithmic frameworks
which makes such approaches difficult to reuse.
1) Recommendation Recommendation
Phase List
Listening History
2) Optimization
Phase
Optimized Next-Track
Recommendations
Recommendation phase. The goal of the first phase (the recommendation phase)
is to determine a relevance score for each possible next track t∗ given the current
listening history or the list of tracks added so far to a playlist (playlist beginning) h,
using different input signals. A weighted scoring scheme – similar to Equation 2.1 –
is then used to combine a baseline next-track recommending algorithm (e.g., kNN)
with additional suitability scores. In general, the combination of different scores
shall serve two purposes, i.e., (1) increasing the hit rate as more relevant tracks
receive a higher aggregated score and (2) making the next-track recommendations
more homogeneous.
Another suitability score in our experiments was based on musical features like the
tempo or loudness, and release year or popularity of the tracks. If we, for example,
detect that the homogeneity of the tempo of the tracks is most probably a guiding
quality criterion, we should give an extra relevance weight to tracks that are similar
in tempo to those in the history. We assumed that if a feature is relevant, the spread
of the values will be low. For instance, a low spread and variance of the tempo
values of the tracks in the listening history – e.g., in the range of 110 to 120 bpm –
indicates that the user generally prefers to listen to moderato music.
2.6 Challenges 29
To be able to combine this signal with the baseline recommendation technique, the
Gaussian distribution of numerical features like tempo can be used as a suitability
score of a target track (t∗ ). Mean (µ) and standard deviation (σ) are computed
based on the distributions of the respective numerical feature in the history (h),
see Equation 2.3.
(ft∗ −µh )2
1 −
∗ 2σ 2
scorefeature (h, t ) = √ e h (2.3)
σh 2π
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ …
Re-ranking scheme. Figure 2.5 illustrates the re-ranking scheme based on the artist
diversity problem. The elements in dotted lines in the listening history rectangle
represent the selected seed tracks and the different colors represent different artists.
The seed tracks can be taken from the set of the user’s last played tracks or be a
subset of the user’s favorite tracks. Based on the seed tracks, the user’s individual
tendency towards artist diversity can be computed. Again, different colors in the
generated next-track recommendation list represent different artists. Since the
top-10 tracks of the recommendation list have a lower artist diversity than the seed
tracks in the listening history, the algorithm then starts exchanging elements from
the top of the list with elements from the end of the list (“exchange list”), which
probably have a slightly lower predicted relevance but help to improve the diversity
of the top-10 list, i.e., to minimize the difference between the diversity level of the
top-10 list and the seed tracks. So if a user generally prefers lists with high diversity,
the re-ranking will lead to higher diversity. Vice versa, a user who usually listens to
various tracks from the same artist in a session will receive recommendations with a
lower artist diversity. As a result, the definition of a globally desired artist diversity
level can be avoided.
2.6 Challenges 31
3
Evaluation of Next-Track
Recommendations
Bonnin et al. [Bon+14] discuss two approaches to determine quality criteria for next-
track music recommendations (in the context of the playlist continuation problem),
(1) analyzing the characteristics of user playlists, and (2) conducting user studies.
An intuitive way to learn how to select a good track to be played next is to look at the
characteristics of the tracks that have been selected by real users in, e.g., previous
listening sessions or playlists. For instance, to determine the general principles for
designing a next-track recommender for the task of automatic playlist continuation,
it will be helpful to analyze playlists that are created and shared by users, assuming
that such hand-crafted playlists have been created carefully and are of good quality
[Bon+14].
33
In the research literature, Slaney et al. [Sla+06], for example, investigated whether
users prefer to create homogeneous or rather diverse playlists based on genre
information about the tracks. In the study, they analyzed 887 manually created
playlists. The results showed that users’ playlists usually contain several genres and
concluded therefore that genre diversity is a relevant feature for users. In another
work, Sarroff et al. [Sar+12] focused on track transitions and examined the first 5
songs of about 8,500 commercial albums for latent structures. The results of two
feature selection experiments using a Gaussian mixture model and a data filtering
technique showed that fade durations and the mean timbre of song endings and
beginnings are the most discriminative features of consecutive songs in an album.
Similar to [Sla+06] and [Sar+12], in [Jan+14], which is one of the papers included
in this thesis by publication, we analyzed a relatively large set of manually created
playlists that were shared by users. Our primary goal in this work was to obtain
insights on the principles that a next-track recommendation algorithm should con-
sider to deliver better or more “natural” playlist continuations. We used samples of
hand-crafted playlists from three different sources including last.fm, artofthemix.org
and 8tracks.com. Overall, we analyzed 10,000 playlists containing about 108,000
distinct tracks of about 40,000 different artists. Using the public APIs of Last.fm,
The Echo Nest16 , and the MusicBrainz database17 , we first retrieved additional
information about audio features like the tempo, energy, and loudness of the tracks
as well as their play counts and social tags assigned to them by users.
The goal of the first analysis in [Jan+14] was to determine the user tendency
towards popular tracks. As a measure of popularity, we considered the total number
of times a track was played on Last.fm. The results showed that users actually
include more popular tracks (in terms of play count) in the beginning of the playlists
in all datasets. Moreover, to measure the concentration biases in the user playlists,
we calculated the Gini index to measure the inequality among the catalog items.
The Gini index revealed that the tracks in the playlist beginnings are selected from
smaller sets of tracks and the diversity slightly increases at the end of playlists.
Next, we analyzed to what extent the users’ playlists contain recently released tracks.
We compared the creation year of each playlist with the average release year of
its tracks. The results showed that users include relatively fresh tracks (that were
released on average in the last 5 years) in their playlists. Furthermore, our results
revealed that the tracks of a user playlist are often homogeneous with respect to the
release date.
16
http://the.echonest.com/
17
https://musicbrainz.org/
• User studies are time consuming and expensive. The participants have to listen
to a number of tracks during such experiments, which can take a long time.
• Academic user studies often have a limited size. One limitation of the conducted
user studies in the music domain in academia is that they involve 10 to
20 participants in total, see, e.g., [Swe+02; Cun+06; Bau+10; Lam+10;
Stu+11]; or [Tin+17]. This makes it difficult to generalize the findings of
such studies.
Most of the user studies in the music domain focus on how users search for music and
on the social or contextual aspects of listening to music [Dow03; Lee+04; Cun+06;
Cun+07; Lam+10]. Few studies also analyze the factors that could influence the
selection of next-track recommendations by users. For instance, in the context of
playlist generation, Stumpf et al. [Stu+11] presented the results of a user study on
how users create playlists in different listening contexts. Analyzing the interactions
of 7 participants with a playlisting tool (iTunes) in think-aloud sessions indicated
that in more personal use cases like private travel, mood was selected as the most
relevant feature for users, whereas in more public situations like large party or small
gathering the rhythmic quality of songs was selected as the most important feature.
Moreover, tempo and genre were identified as context-independent features that were
considered equally important in all the examined contexts.
In the context of this thesis, we conducted a between-subject user study involving 123
subjects to, among others, determine the relevant quality criteria for playlists. The
findings of this study could help us better understand which quality characteristics
should be considered when designing next-track recommending algorithms for
playlist construction support. In the following, we present this user study in detail.
Study design. We developed a web application for the purpose of this user study.
Using this application, the participants were asked to create a playlist with one of the
pre-defined themes including rock night, road trip, chill out, dance party, and hip hop
club. After choosing a topic, the participants were forwarded to the playlist creation
page. All participants could use the provided search functionality to look for their
favorite tracks or artists. However, to analyze the effect of automated next-track
recommendations on the playlist creation behavior of users, the participants of the
experimental group (Rec) received additional recommendations as shown at the
bottom of Figure 3.1. The control group (NoRec) was shown the same interface but
without the recommendations bar at the bottom.
Both the search and the recommendation functionality were implemented using
the public API of Spotify18 , which allowed us to rely on industry-strength search
and recommendation technology. When the playlist contained at least six tracks,
the participants could proceed to the post-task questionnaire, in which they should
accomplish the following tasks.
18
https://developer.spotify.com/web-api/
2. In the next step, participants of the experimental group (Rec) who were
provided with recommendations were asked if they had looked at the recom-
mendations during the task and if so, how they assessed their quality in terms
of relevance, novelty, accuracy, diversity (in terms of genre and artist), famil-
iarity, popularity, and freshness. Participants could express their agreement
with the provided statements, e.g., “The recommendations were novel”, on a
7-point Likert item or state that they could not tell, see Figure 3.2(b).
3. In the final step, all participants were asked (1) how often they create playlists,
(2) about their musical expertise, and (3) how difficult they found the playlist
creation task, again using 7-point Likert items. Free text form fields were
provided for users to specify which part of the process they considered the
most difficult one and for general comments and feedback, see Figure 3.2(c).
The user study ended with questions about the age group and email address of
the participants.
General statistics. At the end, 123 participants (mainly students, aged between
20 and 40) completed the study. Based on the self-reported values, the participants
considered themselves as experienced or interested in music. However, they found
the playlist creation task comparably difficult. 57% of the participants were assigned
to the Rec group (with recommendations). Almost half of these participants (49%)
drag-and-dropped at least one of the recommended tracks to their playlists. We
denote this group as RecUsed. The other half will be denoted as RecNotUsed.
Study outcomes. Considering the topic of this section, first the relevant quality
criteria that were determined by the subjects of the study are introduced. Afterwards,
the observations on the effect of next-track recommendations on the playlist creation
behavior of users and on the resulting playlists are briefly summarized.
Investigating quality criteria for playlists.To determine the overall ranking of qual-
ity criteria based on the responses of the participants, we used a variant of the Borda
Count rank aggregation strategy which is designed for the aggregation of partial
rankings called Modified Borda Count (MBC) [Eme13], as the criteria could also be
marked as irrelevant.
(b) Task 2 (only for the participants of the experimental group): Have you looked at the
recommendations? If yes, how do you assess their quality?
Figure 3.2: The questionnaire of the user study. Note that the screen captures in section (b)
and (c) illustrate only the beginning (the first question) of the respective tasks.
The results of the overall ranking are shown in Table 3.1. Some of the interesting
observations can be summarized as follows.
• Homogeneity of musical features like tempo or loudness along with the artist
diversity of tracks were considered as the most relevant quality criteria.
• The order of the tracks in a playlist and their freshness appeared to be less
relevant for the participants.
• The participants who used the recommendations considered transition as a less
relevant criterion than the participants who did not use any recommendations.
One explanation might be that using recommendations can reduce the effort
that is needed to keep the transitions between the tracks and users therefore
pay less attention to that.
• for road-trip and hip-hop playlists, the lyrics aspect was more important, and
• popularity was considered a very important criterion only for dance playlists.
We further analyzed the collected logs and the musical features of the resulting
playlists to obtain a better understanding of the effects of next-track recommenda-
tions on the playlist creation behavior of users. Although the observations that are
presented in the following are not directly related to the topic of this section, they
contain interesting insights on the perception and adoption of next-track recommen-
dations, which can be relevant for evaluating music recommender systems.
19
In this study, we used the Mann-Whitney U test and the Student’s t-test – both with p < 0.05 – to
test for statistical significance for the ordinal data and the interval data, respectively.
20
For a detailed description of the audio features listed in Table 3.2, see https://developer.spotify.
com/web-api/get-audio-features/
Having discussed the determination of relevant quality criteria for next-track music
recommendations, this section will focus on the assessment of performance of next-
track music recommendation algorithms and present the related works that have
been done in the context of this thesis.
Human evaluation refers to user studies in which participants rate the quality of
playlists generated by one or more algorithms in different dimensions, e.g., the
perceived quality, diversity, or the transition between the tracks. Direct human
evaluations are in principle expensive to conduct and it is also difficult to reproduce
their results, see Section 3.1.2.
The evaluation approaches based on the semantic cohesion determine the quality of
a generated playlist by measuring how similar the tracks in the playlist are. Different
similarity measures like the co-occurrence counts of track metadata, e.g., artists
[Log02; Log04], entropy of the distribution of genres within the playlist [Kne+06;
Dop+08], or the distance between latent topic models of playlists [Fie+10] have
been proposed in the literature. The similarity of the tracks, however, may not
always be a good (or at least the only) quality criterion in real-world scenarios
[Sla+06; Lee+11; Kam+12a].
The third group of evaluation approaches that were presented in [McF+11] rely
on information retrieval (IR) measures. In these approaches, a playlist generation
algorithm is evaluated based on its prediction. A prediction is successful, only if the
predicted next track matches an observed co-occurrence in the ground truth set, e.g.,
the available user playlists [Pla+01; Mai+09]. In this evaluation setting, a prediction
that might be interesting for the user but does not match the co-occurrences in the
ground truth set will be considered as a false positive. This group of evaluation
approaches will be discussed later in more detail in Section 3.2.2.
In a more recent work, Bonnin et al. [Bon+14] proposed to categorize the evaluation
approaches to the playlist continuation problem in the four more general groups of
(1) log analysis, (2) objective measures, (3) comparison with hand-crafted playlists,
and (4) user studies. The following sections discuss these four approaches.
Music platforms like Spotify analyze the listening logs that they collect, for example,
through A/B tests to better understand the listening behavior of their users. Among
others, these logs can be used to evaluate the acceptance of the recommendations.
Although conducting field tests with real users in academia is usually not possible,
there are different data sources from which information about the users’ listening
behavior can be obtained. For instance, listening logs of Last.fm users can be accessed
through the platform’s public API22 [Bon+14].
In addition, there are some public listening logs datasets like the #nowplaying dataset
[Pic+15], which contains information about the listening sessions collected from
music-related tweets on Twitter or the 30Music dataset [Tur+15], which contains
listening sessions retrieved from Internet radio stations through the Last.fm API. In
[Jan+17a], we used subsets of these two datasets to explore the value of repeated
recommendations of known tracks (see Section 2.6.1). Moreover, in [Kam+17a],
we used, among others, the music listening sessions to evaluate the quality of the
next-track recommendations of different session-based recommending algorithms
(see Section 2.5). One advantage of using listening logs for the evaluation task is
the reproducibility of the results [Bon+14].
Schedl et al. [Sch+17b] review the most frequently reported evaluation measures
in the academic literature. They differentiate between accuracy-related and beyond-
accuracy measures. For example, mean absolute error (MAE) and root mean square
error (RMSE), which indicate the prediction error of a recommendation algorithm,
or precision and recall, which measure the relevance of the recommendations, have
been applied in the field of recommender systems to evaluate accuracy. On the other
hand, novelty, which measures the ability of recommender systems to help users
discover new items, and serendipity, which measures how unexpected the novel
recommendations are, are examples of beyond-accuracy measures.
22
https://www.last.fm/api
User playlist
Seed half Test half
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
Compare in evaluation
Recommender ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ …
Playlist continuation
(Next-track recommendations)
Figure 3.3: The evaluation protocol proposed in [Jan+16].
1. To which extent can different algorithms recommend (1) the right tracks, (2)
the relevant artists, (3) the correct genres, and (4) the tracks with suitable tags?
2. To which extent can the algorithms produce continuations that are coherent
with the playlist beginnings in terms of different musical features?
Concerning the first question, which mainly deals with the accuracy performance of
the algorithms, an interesting observation was that the commercial recommender
led, in most cases, to the lowest precision and recall values. This can be an indication
that the playlists that are generated by the commercial service are not necessarily
(exclusively) optimized for precision or recall and that also other criteria govern
the track selection process. Another observation according to the accuracy results
was that the comparably simple CAGH method led to competitive accuracy results
especially in cases where the goal is to play music of similar artists, related genres or
to find tracks that are similar in terms of their social tags.
To answer the second question regarding the coherence of the generated contin-
uations of the algorithms, we looked at the mean and distributions of the feature
values in the seed halves, test halves, and the generated continuations. In general,
our analyses showed that users prefer playlist continuations (test halves) that are
coherent with the first halves, however, for some features, like tempo or release
years, all recommenders were able to mimic the user’s behavior and for some other
features, like popularity or loudness, the algorithms showed strong biases. More
precisely, the explored academic recommenders focused on more popular tracks
and the variability of the generated continuations of these algorithms in terms of
popularity was higher than the user-created test halves. In contrast, the commercial
service recommended less popular tracks and reduced the loudness and popularity
diversity in the generated continuations more than the users do.
In Section 3.1.2, we discussed that one reliable way to determine the relevant
quality criteria for next-track recommendations is to conduct user studies. User
studies can also be applied as an evaluation approach for the quality of next-track
recommendations. For instance, Barrington et al. [Bar+09] compared Apple’s
Genius collaborative filtering based system with a recommendation method based
on artist similarity in a user study. In their experimental setting, they hid the
artist and track information in one condition and displayed this information in
another. An interesting insight was that the recommendations of the Genius system
were perceived better when the information was hidden, whereas the artist-based
recommendations were selected as the better ones in the other case.
Lessons learned from offline experiments. Our main goal was to validate some
of the insights that were obtained from offline experiments by utilizing an online
experiment with real users. Specifically, the following offline observations from
[Har+12; Bon+14; Jan+15a; Jan+16]; and [Jan+17a] were selected to be tested
in our user study.
In the first step, the participants had to listen to four tracks of a selected playlist. To
minimize the familiarity bias, information about the artists and tracks were hidden,
see Figure 3.5(a). When the participants had listened to the tracks, they had to
answer five questions about the emotion, energy, topic, genre, and tempo of the tracks.
Using 7-point Likert items, the participants should state how similar the tracks of
the playlists were in any of the these dimensions, see Figure 3.5(b). Next, the
participants were presented with four alternative continuations for the given playlist
from task 1. The recommended next tracks were also anonymized and displayed
in randomized order across participants to avoid any order bias. The participants
should state how well each track matches the playlist as its next track and indicate if
Welcome
to an interactive user study about music recommender systems.
In each trial, one playlist was randomly assigned to the participants. One main
question when designing the study was how to select the seed playlists for which
we wanted to evaluate the next-track recommendations. Since we aimed to ana-
https://ls13ap85.cs.tu-dortmund.de:8443/music-survey/begin/begin.jsp?language=en[09.12.2017 19:38:59]
lyze whether hand-crafted playlists are suitable for evaluating the recommending
techniques, we chose five hand-crafted playlists. To assess if the choice of the most
suitable next track is influenced by certain characteristics of the playlists, we selected
these playlists in a way that each one was very homogeneous in one dimension.
1. Topic-playlist. This playlist was organized around the topic Christmas with pop
songs from the 70s and 80s (Table 3.3, section (1)).
2. Genre-playlist. This playlist contained tracks of the genre “soul” (Table 3.3,
section (2)).
3. Mood-playlist. This playlists included tracks with romantic lyrics (Table 3.3,
section (3)).
(b) Task 2: Determine the similarity of the tracks of the given playlist.
(c) Task 3: Evaluate the suitablity of each alternative next track for the given playlist.
Figure 3.5: The tasks of the user study in [Kam+17b]. Note that sections (b) and (c) of this
figure show only the beginning of the respective tasks.
(1) Topic-Playlist
Title Artist Top Tags
Track #1 Do They Know It’s Christmas Band Aid Xmas, 80s, Pop, Rock, . . .
Track #2 Happy Xmas (War Is Over) John Lennon Xmas, Rock, Pop, 70s, . . .
Track #3 Thank God It’s Christmas Queen Xmas, Rock, 80s, . . .
Track #4 Driving Home For Christmas Chris Rea Xmas, Rock, Pop, 80s, . . .
Hidden Track White Christmas Bing Crosby Xmas, Oldies, Jazz, . . .
CAGH Bohemian Rhapsody Queen Rock, Epic, British, 70s, . . .
kNN Santa Baby Eartha Kitt Xmas, Jazz, 50s, . . .
kNN+X Step Into Christmas Elton John Xmas, Pop, Piano, 70s, . . .
(2) Genre-Playlist
Title Artist Artist Genres
Track #1 The Dark End Of The Street James Carr Soul, Motown, Soul Blues, . . .
Track #2 I Can’t Stand The Rain Ann Peebles Soul, Motown, Soul Blues, . . .
Track #3 Because Of You Jackie Wilson Soul, Motown, Soul Blues, . . .
Track #4 Mustang Sally Wilson Pickett Soul, Motown, Soul Blues, . . .
Hidden Track Cigarettes And Coffee Otis Redding Soul, Motown, Soul Blues, . . .
CAGH In The Midnight Hour Wilson Pickett Soul, Motown, Soul Blues, . . .
kNN Ever Fallen In Love Thea Gilmore Folk-Pop, New Wave Pop, . . .
kNN+X I Can’t Get Next To You Al Green Soul, Motown, Soul Blues, . . .
(3) Mood-Playlist
Title Artist Mood
Track #1 Memory Motel The Rolling Stones Romantic
Track #2 Harvest Moon Neil Young Romantic
Track #3 Full Of Grace Sarah McLachlan Romantic
Track #4 Shiver Coldplay Romantic
Hidden Track Beast Of Burden The Rolling Stones Romantic
CAGH Yellow Coldplay Romantic
kNN Here It Comes Doves Calm
kNN+X Twilight Elliott Smith Romantic
Considering the offline insights that were discussed earlier in this section, the four
alternative tracks to be played next in each trial were selected using the following
approaches.
1. Hidden Track. In each trial, we presented the first four tracks of the chosen
hand-crafted playlist to the participants. One alternative to continue this
playlist was the actual fifth track of the playlist that was originally chosen by
the playlist creator which is referred to as “hidden track” in the experiment.
2. Borda count. We apply the Borda count measure [Eme13] to aggregate the
rankings of all four alternatives. The responses provided by the participants
are used here as implicit ranking information.
Furthermore, to investigate to what extent familiarity aspects may affect the results,
in addition to considering the rankings of all trials, we also reported the results for
only those trials in which the participants explicitly indicated that they did not know
the track that they selected as the most suitable track, i.e., 70% of all trials. We refer
to the former configuration setting as “All Tracks” configuration and to the latter
setting as “Novel Tracks” configuration in our experiment. Table 3.4 summarizes the
overall ranking results.
Table 3.4: Overall ranking results of the next-track recommending techniques with respect
to the users’ quality perception based on winning frequency (WF) and Borda
count (BC) [Kam+17b].
Results. Several observations from offline studies were reproduced in this user
study. Regarding the insights obtained from offline experiments that were mentioned
earlier in this section, we categorized our observations into the following four
groups.
2. Users prefer recommendations that are more coherent with their recently played
tracks. The perceived quality of the recommendations of the hybrid method,
which are more coherent with recently played tracks in terms of the dominating
characteristic of the seed playlist, were considered significantly more suitable
than the recommendations that are only based on track-co-occurrence patterns
in both configurations and on both measures.
It should also be noted that the ranking of the algorithms could vary from the overall
ranking results when the trials with a particular playlist are considered. For instance,
while the kNN+X recommendations were generally ranked higher than those of the
kNN method, for the tempo-oriented playlist, the kNN and CAGH methods were, on
average, ranked higher than the kNN+X method. This could be interpreted as less
relevance of the tempo than other characteristics like artist homogeneity, which was
also observed in [Jan+14], see Section 3.2.3.
4.1 Summary
The first chapter of this thesis introduced a brief history of music along with a
general characterization of the music recommendation problem. The next-track
recommendation scenario was then presented in this chapter. Moreover, the research
questions that this thesis aimed to answer were categorized and briefly discussed.
The algorithmic approaches that have been proposed for next-track recommendation
in the research literature were reviewed in the second chapter of this thesis. Particu-
larly, content-based filtering approaches, collaborative filtering methods, frequent
55
pattern mining techniques, and sequence-aware algorithms were discussed and a
number of published works on each topic were introduced. Afterwards, the results of
comparing a number of these approaches in different dimensions, such as accuracy,
popularity bias, and computational complexity that was conducted in the context of
this thesis were presented.
The evaluation of next-track recommendations was the topic of the third chapter of
this thesis. A critical question in this regard is how to determine quality criteria for
next-track recommendations. One way to do this is to analyze the characteristics
of playlists that are created and shared by users based on musical and metadata
features of the tracks. An experimental analysis of 10,000 hand-crafted playlists
in [Jan+14], for instance, revealed that features like freshness, popularity, and
homogeneity of the tracks are relevant for users. The insights from such analyzes
should help researchers design algorithms that recommend more natural next tracks.
Another way to determine the relevant quality criteria is to conduct user studies. As
an example, a user study that was conducted recently in the context of this thesis was
presented in this chapter. The findings of this study involving 123 subjects indicated
that the homogeneity of musical features, such as tempo and energy, along with the
artist diversity are important characteristics for playlists and should be considered
when recommending next tracks, e.g., for supporting playlist construction.
56 Chapter 4 Conclusion
4.2 Perspectives
The first aspect relates to psychological factors. They argue that despite the in-
dicated effect of personality and emotion on music tastes [Fer+15; Sch+17a],
“psychologically-inspired” music recommender systems have not been investigated
to a large extent so far.
Another aspect that can affect the future of personalized music recommender systems
is the incorporation of situational signals into the recommendation process. Although
several academic works have explored the value of situational information like
location or time of the day in music recommender systems [Bal+11; Wan+12;
Kam+13; Che+14], such signals have not been integrated in large scale commercial
systems yet.
The last research perspective for music recommender systems that was discussed
in Schedl et al. [Sch+17b] relates to cultural aspects like language, religion, or
history. The idea is to study the impact of cultural backgrounds and differences on
the listening behavior of users, as done for instance in Schedl [Sch17], and to build
cultural user models that can be integrated into recommender systems.
The publications included in this thesis utilized different musical features of the
tracks to infer the underlying theme of a playlist or listening session as a basis for
generating or evaluating next-track recommendations. The selection of the musical
features was, however, limited to publicly available data. A future work in this
regard would be to acquire and exploit additional information about the tracks and
artists that could help to reach a better understanding of the desired characteristics
of the seed tracks and to enhance the quality of next-track recommendations.
4.2 Perspectives 57
Bibliography
[Agr+95] Rakesh Agrawal and Ramakrishnan Srikant. “Mining Sequential Patterns”. In:
Proceedings of the Eleventh International Conference on Data Engineering. ICDE
’95. 1995, pp. 3–14 (cit. on p. 20).
[Aiz+12] Natalie Aizenberg, Yehuda Koren, and Oren Somekh. “Build Your Own Music
Recommender by Modeling Internet Radio Streams”. In: Proceedings of the 21st
international conference on World Wide Web. 2012, pp. 1–10 (cit. on p. 16).
[And+14] Ashton Anderson, Ravi Kumar, Andrew Tomkins, and Sergei Vassilvitskii. “The
Dynamics of Repeat Consumption”. In: Proceedings of the 23rd International
Conference on World Wide Web. WWW ’14. 2014, pp. 419–430 (cit. on p. 26).
[Bal+11] Linas Baltrunas, Marius Kaminskas, Bernd Ludwig, Omar Moling, Francesco
Ricci, Aykan Aydin, et al. “InCarMusic: Context-Aware Music Recommendations
in a Car”. In: E-Commerce and Web Technologies (2011), pp. 89–100 (cit. on
p. 57).
[Ban+16] Trapit Bansal, David Belanger, and Andrew McCallum. “Ask the GRU: Multi-
task Learning for Deep Text Recommendations”. In: Proceedings of the 10th ACM
Conference on Recommender Systems. RecSys ’16. 2016, pp. 107–114 (cit. on
p. 20).
[Bar+09] Luke Barrington, Reid Oda, and Gert R. G. Lanckriet. “Smarter than Genius?
Human Evaluation of Music Recommender Systems”. In: Proceedings of the
10th International Society for Music Information Retrieval Conference. 2009,
pp. 357–362 (cit. on p. 47).
[Bau+10] Dominikus Baur, Sebastian Boring, and Andreas Butz. “Rush: Repeated Rec-
ommendations on Mobile Devices”. In: Proceedings of the 15th International
Conference on Intelligent User Interfaces. IUI ’10. 2010, pp. 91–100 (cit. on
p. 36).
[Ben+07] James Bennett, Stan Lanning, et al. “The Netflix Prize”. In: Proceedings of KDD
Cup and Workshop. 2007, p. 35 (cit. on p. 27).
59
[Ben09] Yoshua Bengio. “Learning Deep Architectures for AI”. In: Foundations and trends
in Machine Learning 2.1 (2009), pp. 1–127 (cit. on p. 16).
[Blu+99] T.L. Blum, D.F. Keislar, J.A. Wheaton, and E.H. Wold. Method and Article of
Manufacture for Content-Based Analysis, Storage, Retrieval, and Segmentation of
Audio Information. US Patent 5,918,223. 1999 (cit. on p. 16).
[Bog+10] Dmitry Bogdanov, M. Haro, Ferdinand Fuhrmann, Emilia Gómez, and Perfecto
Herrera. “Content-Based Music Recommendation Based on User Preference
Examples”. In: The 4th ACM Conference on Recommender Systems. Workshop on
Music Recommendation and Discovery. 2010 (cit. on pp. 15, 16).
[Bog+11] Dmitry Bogdanov and Perfecto Herrera. “How Much Metadata Do We Need
in Music Recommendation? A Subjective Evaluation Using Preference Sets.”
In: Conference of the International Society for Music Information Retrieval. 2011,
pp. 97–102 (cit. on p. 16).
[Bon+13] Geoffray Bonnin and Dietmar Jannach. “Evaluating the Quality of Generated
Playlists Based on Hand-Crafted Samples”. In: Proceedings of the 14th Interna-
tional Society for Music Information Retrieval Conference. 2013, pp. 263–268
(cit. on p. 45).
[Bra+01] Keith Bradley and Barry Smyth. “Improving Recommendation Diversity”. In:
Proceedings of the Twelfth Irish Conference on Artificial Intelligence and Cognitive
Science. 2001, pp. 85–94 (cit. on p. 28).
[Bud+12] Karan Kumar Budhraja, Ashutosh Singh, Gautav Dubey, and Arun Khosla.
“Probability Based Playlist Generation Based on Music Similarity and User
Customization”. In: National Conference on Computing and Communication
Systems. 2012, pp. 1–5 (cit. on p. 17).
[Bur02] Robin Burke. “Hybrid Recommender Systems: Survey and Experiments”. In:
User Modeling and User-Adapted Interaction 12.4 (Nov. 2002), pp. 331–370
(cit. on p. 18).
[Can+04] Pedro Cano and Markus Koppenberger. “The Emergence of Complex Network
Patterns in Music Artist Networks”. In: Proceedings of the 5th International
Symposium on Music Information Retrieval. 2004, pp. 466–469 (cit. on p. 18).
[Cas+08] Michael A. Casey, Remco Veltkamp, Masataka Goto, Marc Leman, Christophe
Rhodes, and Malcom Slaney. “Content-Based Music Information Retrieval:
Current Directions and Future Challenges”. In: Proceedings of the IEEE 96.4
(2008), pp. 668–696 (cit. on pp. 5, 16).
60 Bibliography
[Cel08] Òscar Celma. “Music Recommendation and Discovery in the Long Tail”. PhD
thesis. Barcelona: Universitat Pompeu Fabra, 2008 (cit. on pp. 17, 18).
[Cel10] Òscar Celma. Music Recommendation and Discovery - The Long Tail, Long Fail,
and Long Play in the Digital Music Space. Springer, 2010 (cit. on p. 8).
[Che+12] Shuo Chen, Josh L. Moore, Douglas Turnbull, and Thorsten Joachims. “Playlist
Prediction via Metric Embedding”. In: Proceedings of the 18th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. KDD ’12.
2012, pp. 714–722 (cit. on pp. 15, 25, 45).
[Che+16] Chih-Ming Chen, Ming-Feng Tsai, Yu-Ching Lin, and Yi-Hsuan Yang. “Query-
based Music Recommendations via Preference Embedding”. In: Proceedings of
the 10th ACM Conference on Recommender Systems. RecSys ’16. 2016, pp. 79–82
(cit. on p. 5).
[Cli06] Dave Cliff. “hpDJ: An Automated DJ with Floorshow Feedback”. In: Consuming
Music Together: Social and Collaborative Aspects of Music Consumption Technolo-
gies. Ed. by Kenton O’Hara and Barry Brown. Dordrecht: Springer Netherlands,
2006, pp. 241–264 (cit. on p. 6).
[Coe+13] Filipe Coelho, José Devezas, and Cristina Ribeiro. “Large-scale Crossmedia
Retrieval for Playlist Generation and Song Discovery”. In: Proceedings of the
10th Conference on Open Research Areas in Information Retrieval. OAIR ’13.
2013, pp. 61–64 (cit. on p. 16).
[Cov+16] Paul Covington, Jay Adams, and Emre Sargin. “Deep Neural Networks for
YouTube Recommendations”. In: Proceedings of the 10th ACM Conference on
Recommender Systems. RecSys ’16. 2016, pp. 191–198 (cit. on p. 20).
[Cre+11] Paolo Cremonesi, Franca Garzotto, Sara Negro, Alessandro Vittorio Papadopou-
los, and Roberto Turrin. “Looking for “Good” Recommendations: A Compara-
tive Evaluation of Recommender Systems”. In: Human-Computer Interaction
– INTERACT 2011: 13th IFIP TC 13 International Conference, Lisbon, Portugal,
September 5-9, 2011, Proceedings, Part III. Ed. by Pedro Campos, Nicholas Gra-
ham, Joaquim Jorge, Nuno Nunes, Philippe Palanque, and Marco Winckler.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 152–168 (cit. on
p. 44).
[Cre+12] Paolo Cremonesi, Franca Garzotto, and Roberto Turrin. “Investigating the
Persuasion Potential of Recommender Systems from a Quality Perspective: An
Empirical Study”. In: ACM Transactions on Interactive Intelligent Systems 2.2
(June 2012), 11:1–11:41 (cit. on p. 27).
Bibliography 61
[Cun+06] Sally Jo Cunningham, David Bainbridge, and Annette Falconer. “More of an Art
than a Science: Supporting the Creation of Playlists and Mixes”. In: Proceedings
of 7th International Conference on Music Information Retrieval. 2006, pp. 240–
245 (cit. on pp. 35, 36).
[Cun+07] Sally Jo Cunningham, David Bainbridge, and Dana McKay. “Finding New Music:
A Diary Study of Everyday Encounters with Novel Songs”. In: Proceedings of the
8th International Conference on Music Information Retrieval. 2007, pp. 83–88
(cit. on p. 36).
[Die+14] Sande Dieleman and Benjamin Schrauwen. “End-to-End Learning for Music
Audio”. In: 2014 IEEE International Conference on Acoustics, Speech and Signal
Processing. 2014, pp. 6964–6968 (cit. on p. 16).
[Dop+08] Markus Dopler, Markus Schedl, Tim Pohle, and Peter Knees. “Accessing Music
Collections Via Representative Cluster Prototypes in a Hierarchical Organization
Scheme”. In: Conference of the International Society for Music Information
Retrieval. 2008, pp. 179–184 (cit. on p. 43).
[Dow03] J. Stephen Downie. “Music Information Retrieval”. In: Annual Review of Infor-
mation Science and Technology 37.1 (2003), pp. 295–340 (cit. on p. 36).
[Eke+00] Robert B. Ekelund Jr, George S. Ford, and Thomas Koutsky. “Market Power in
Radio Markets: An Empirical Analysis of Local and National Concentration”.
In: The Journal of Law and Economics 43.1 (2000), pp. 157–184 (cit. on p. 6).
[Elk+15] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. “A Multi-View Deep Learn-
ing Approach for Cross Domain User Modeling in Recommendation Systems”.
In: Proceedings of the 24th International Conference on World Wide Web. WWW
’15. 2015, pp. 278–288 (cit. on p. 20).
[Eme13] Peter Emerson. “The Original Borda Count and Partial Voting”. In: Social Choice
and Welfare 40.2 (2013), pp. 353–358 (cit. on pp. 38, 53).
[Fer+15] Bruce Ferwerda, Markus Schedl, and Marko Tkalcic. “Personality & Emotional
States: Understanding Users’ Music Listening Needs”. In: Posters, Demos, Late-
breaking Results and Workshop Proceedings of the 23rd Conference on User
Modeling, Adaptation, and Personalization. 2015 (cit. on p. 57).
[Fie+10] Ben Fields, Christophe Rhodes, Mark d’Inverno, et al. “Using Song Social Tags
and Topic Models to Describe and Compare Playlists”. In: 1st Workshop On
Music Recommendation And Discovery. 2010 (cit. on p. 43).
[Gra+14] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural Turing Machines”. In:
CoRR abs/1410.5401 (2014) (cit. on p. 20).
62 Bibliography
[Har+12] Negar Hariri, Bamshad Mobasher, and Robin Burke. “Context-aware Music
Recommendation Based on Latenttopic Sequential Patterns”. In: Proceedings of
the Sixth ACM Conference on Recommender Systems. RecSys ’12. 2012, pp. 131–
138 (cit. on pp. 8, 19, 20, 45, 48).
[Hid+15] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
“Session-Based Recommendations with Recurrent Neural Networks”. In: CoRR
abs/1511.06939 (2015) (cit. on pp. 20–23).
[Hid+17] Balázs Hidasi and Alexandros Karatzoglou. “Recurrent Neural Networks with
Top-k Gains for Session-Based Recommendations”. In: CoRR abs/1706.03847
(2017) (cit. on p. 20).
[Hin+06] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. “A Fast Learning
Algorithm for Deep Belief Nets”. In: Neural Computation 18.7 (2006), pp. 1527–
1554 (cit. on p. 16).
[Hum+12] Eric J. Humphrey, Juan Pablo Bello, and Yann LeCun. “Moving Beyond Feature
Design: Deep Architectures and Automatic Feature Learning in Music Informat-
ics.” In: Proceedings of the 13th International Conference on Music Information
Retrieval. 2012, pp. 403–408 (cit. on p. 16).
[Jam+10] Tamas Jambor and Jun Wang. “Optimizing Multiple Objectives in Collabora-
tive Filtering”. In: Proceedings of the Fourth ACM Conference on Recommender
Systems. RecSys ’10. 2010, pp. 55–62 (cit. on p. 30).
[Jan+12] Dietmar Jannach, Markus Zanker, Mouzhi Ge, and Marian Gröning. “Recom-
mender Systems in Computer Science and Information Systems - A Landscape
of Research”. In: 13th International Conference on Electronic Commerce and Web
Technologies. 2012, pp. 76–87 (cit. on p. 27).
[Jan+14] Dietmar Jannach, Iman Kamehkhosh, and Geoffray Bonnin. “Analyzing the
Characteristics of Shared Playlists for Music Recommendation”. In: Proceedings
of the 6th Workshop on Recommender Systems and the Social Web at ACM RecSys.
2014 (cit. on pp. 8, 12, 28, 34, 54, 56, 75).
[Jan+15a] Dietmar Jannach, Lukas Lerche, and Iman Kamehkhosh. “Beyond "Hitting the
Hits": Generating Coherent Music Playlist Continuations with the Right Tracks”.
In: Proceedings of the 9th ACM Conference on Recommender Systems. RecSys ’15.
2015, pp. 187–194 (cit. on pp. 11, 13, 17, 21, 28–30, 44, 48, 75).
[Jan+15b] Dietmar Jannach, Lukas Lerche, and Michael Jugovac. “Item Familiarity as a
Possible Confounding Factor in User-Centric Recommender Systems Evalua-
tion”. In: i-com Journal of Interactive Media 14.1 (2015), pp. 29–39 (cit. on
p. 36).
[Jan+15c] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael Jugovac.
“What Recommenders Rrecommend: An Analysis of Recommendation Biases
and Possible Countermeasures”. In: User Modeling and User-Adapted Interaction
25.5 (2015), pp. 427–491 (cit. on pp. 27, 47, 76).
[Jan+16] Dietmar Jannach, Iman Kamehkhosh, and Geoffray Bonnin. “Biases in Auto-
mated Music Playlist Generation: A Comparison of Next-Track Recommending
Techniques”. In: Proceedings of the 24th Conference on User Modeling, Adaptation
and Personalization. UMAP ’16. 2016, pp. 281–285 (cit. on pp. 12, 13, 18, 33,
42, 45, 46, 48, 54, 75).
Bibliography 63
[Jan+17a] Dietmar Jannach, Iman Kamehkhosh, and Lukas Lerche. “Leveraging Multi-
dimensional User Models for Personalized Next-track Music Recommendation”.
In: Proceedings of the 32nd ACM SIGAPP Symposium on Applied Computing. SAC
’17. 2017, pp. 1635–1642 (cit. on pp. 4, 6, 7, 11, 13, 25, 27, 33, 44, 48, 75).
[Jan+17b] Dietmar Jannach, Malte Ludewig, and Lukas Lerche. “Session-based Item
Recommendation in E-Commerce: On Short-Term Intents, Reminders, Trends,
and Discounts”. In: User-Modeling and User-Adapted Interaction 27.3–5 (2017),
pp. 351–392 (cit. on pp. 4, 41).
[Jan+17c] Dietmar Jannach and Malte Ludewig. “When Recurrent Neural Networks Meet
the Neighborhood for Session-Based Recommendation”. In: Proceedings of the
Eleventh ACM Conference on Recommender Systems. RecSys ’17. 2017, pp. 306–
310 (cit. on p. 23).
[Jug+17] Michael Jugovac, Dietmar Jannach, and Lukas Lerche. “Efficient Optimization
of Multiple Recommendation Quality Factors According to Individual User
Tendencies”. In: Expert Systems With Applications 81 (2017), pp. 321–331 (cit.
on p. 31).
[Jyl+12] Antti Jylhä, Stefania Serafin, and Cumhur Erkut. “Rhythmic Walking Interac-
tions with Auditory Feedback: An Exploratory Study”. In: Proceedings of the 7th
Audio Mostly Conference: A Conference on Interaction with Sound. AM ’12. 2012,
pp. 68–75 (cit. on p. 6).
[Kam+12a] Mohsen Kamalzadeh, Dominikus Baur, and Torsten Möller. “A Survey on Music
Listening and Management Behaviours”. In: Conference of the International
Society for Music Information Retrieval. 2012, pp. 373–378 (cit. on pp. 28, 36,
43).
[Kam+12b] Marius Kaminskas and Francesco Ricci. “Contextual Music Information Re-
trieval and Recommendation: State of the Art and Challenges”. In: Computer
Science Review 6.2-3 (2012), pp. 89–119 (cit. on p. 8).
[Kam+13] Marius Kaminskas, Francesco Ricci, and Markus Schedl. “Location-aware Music
Recommendation Using Auto-tagging and Hybrid Matching”. In: Proceedings of
the 7th ACM Conference on Recommender Systems. RecSys ’13. 2013, pp. 17–24
(cit. on p. 57).
[Kam+16] Iman Kamehkhosh, Dietmar Jannach, and Lukas Lerche. “Personalized Next-
Track Music Recommendation with Multi-dimensional Long-Term Preference
Signals”. In: Proceedings of the Workshop on Multi-dimensional Information
Fusion for User Modeling and Personalization at ACM UMAP. 2016 (cit. on
pp. 13, 76).
64 Bibliography
[Kam+17b] Iman Kamehkhosh and Dietmar Jannach. “User Perception of Next-Track Music
Recommendations”. In: Proceedings of the 25th Conference on User Modeling,
Adaptation and Personalization. UMAP ’17. 2017, pp. 113–121 (cit. on pp. 12,
14, 36, 45, 48, 50, 51, 53, 75).
[Kap+15] Komal Kapoor, Vikas Kumar, Loren Terveen, Joseph A. Konstan, and Paul
Schrater. “"I Like to Explore Sometimes": Adapting to Dynamic User Nov-
elty Preferences”. In: Proceedings of the 9th ACM Conference on Recommender
Systems. RecSys ’15. 2015, pp. 19–26 (cit. on pp. 26, 28).
[Kne+06] Peter Knees, Tim Pohle, Markus Schedl, and Gerhard Widmer. “Combining
Audio-based Similarity with Web-based Data to Accelerate Automatic Music
Playlist Generation”. In: Proceedings of the 8th ACM International Workshop on
Multimedia Information Retrieval. MIR ’06. 2006, pp. 147–154 (cit. on p. 43).
[Kne+08] Peter Knees, Markus Schedl, and Tim Pohle. “A Deeper Look into Web-Based
Classification of Music Artists”. In: Proceedings of the 2nd Workshop on Learning
the Semantics of Audio Signals. 2008, pp. 31–44 (cit. on p. 17).
[Kne+13] Peter Knees and Markus Schedl. “A Survey of Music Similarity and Recom-
mendation from Music Context Data”. In: ACM Transactions on Multimedia
Computing Communications, and Applications 10.1 (Dec. 2013), 2:1–2:21 (cit.
on pp. 5, 16, 18).
[Kni+12] Bart P. Knijnenburg, Martijn C. Willemsen, Zeno Gantner, Hakan Soncu, and
Chris Newell. “Explaining the User Experience of Recommender Systems”. In:
User Modeling and User-Adapted Interaction 22.4-5 (Oct. 2012), pp. 441–504
(cit. on p. 47).
[Kor+09] Yehuda Koren, Robert Bell, and Chris Volinsky. “Matrix Factorization Techniques
for Recommender Systems”. In: Computer 42.8 (Aug. 2009), pp. 30–37 (cit. on
p. 18).
[Köc+16] Sören Köcher, Dietmar Jannach, Michael Jugovac, and Hartmut H. Holzmüller.
“Investigating Mere-Presence Effects of Recommendations on the Consumer
Choice Process”. In: Proceedings of the Joint Workshop on Interfaces and Human
Decision Making for Recommender Systems at RecSys. 2016 (cit. on p. 42).
[Lam+10] Alexandra Lamont and Rebecca Webb. “Short- and Long-Term Musical Prefer-
ences: What Makes a Favourite Piece of Music?” In: Psychology of Music 38.2
(2010), pp. 222–241 (cit. on p. 36).
[LeC+98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-Based
Learning Applied to Document Recognition”. In: Proceedings of the IEEE 86.11
(1998), pp. 2278–2324 (cit. on p. 16).
[Lee+04] Jin Ha Lee and J. Stephen Downie. “Survey Of Music Information Needs, Uses,
And Seeking Behaviours: Preliminary Findings”. In: 5th International Conference
on Music Information Retrieval. 2004 (cit. on p. 36).
[Lee+11] Jin Ha Lee, Bobby Bare, and Gary Meek. “How Similar Is Too Similar?: Explor-
ing Users’ Perceptions of Similarity in Playlist Evaluation”. In: Conference of the
International Society for Music Information Retrieval. 2011, pp. 109–114 (cit. on
p. 43).
Bibliography 65
[Lee+16] Jin Ha Lee and Rachel Price. “User Experience with Commercial Music Services:
An Empirical Exploration”. In: Journal of the Association for Information Science
and Technology 67.4 (2016), pp. 800–811 (cit. on p. 47).
[Lev+07] Mark Levy and Mark Sandler. “A Semantic Space for Music Derived from Social
Tags”. In: 8th International Conference on Music Information Retrieval. 2007
(cit. on p. 16).
[Lin+03] Greg Linden, Brent Smith, and Jeremy York. “Amazon.Com Recommendations:
Item-to-Item Collaborative Filtering”. In: IEEE Internet Computing 7.1 (Jan.
2003), pp. 76–80 (cit. on p. 4).
[Lip15] Zachary Chase Lipton. “A Critical Review of Recurrent Neural Networks for
Sequence Learning”. In: CoRR abs/1506.00019 (2015). arXiv: 1506.00019
(cit. on p. 20).
[Log+04] Beth Logan, Andrew Kositsky, and Pedro Moreno. “Semantic Analysis of Song
Lyrics”. In: IEEE International Conference on Multimedia and Expo. Vol. 2. 2004,
827–830 Vol.2 (cit. on p. 16).
[Log04] Beth Logan. “Music Recommendation from Song Sets”. In: Conference of the
International Society for Music Information Retrieval. 2004, pp. 425–428 (cit. on
pp. 8, 15, 43).
[Lon+16] Babak Loni, Roberto Pagano, Martha Larson, and Alan Hanjalic. “Bayesian
Personalized Ranking with Multi-Channel User Feedback”. In: Proceedings of the
10th ACM Conference on Recommender Systems. RecSys ’16. 2016, pp. 361–364
(cit. on p. 5).
[Mai+09] François Maillet, Douglas Eck, Guillaume Desjardins, and Paul Lamere. “Steer-
able Playlist Generation by Learning Song Similarity from Radio Station
Playlists”. In: International Society for Music Information Retrieval Conference.
2009, pp. 345–350 (cit. on p. 43).
[McF+11] Brian McFee and Gert RG Lanckriet. “The Natural Language of Playlists”. In:
Conference of the International Society for Music Information Retrieval. Vol. 11.
2011, pp. 537–542 (cit. on pp. 20, 43–45).
[McF+12a] Brian McFee and Gert R. G. Lanckriet. “Hypergraph Models of Playlist Dialects”.
In: Proceedings of the 13th International Society for Music Information Retrieval
Conference. 2012, pp. 343–348 (cit. on p. 23).
[McF+12b] Brian McFee, Thierry Bertin-Mahieux, Daniel P.W. Ellis, and Gert R.G. Lanckriet.
“The Million Song Dataset Challenge”. In: Proceedings of the 21st International
Conference on World Wide Web. WWW ’12 Companion. 2012, pp. 909–916
(cit. on p. 16).
[Moe+10] Bart Moens, Leon van Noorden, and Marc Leman. “D-Jogger: Syncing Music
with Walking”. eng. In: Proceedings of Sound and Music Computing. Vol. online.
2010, pp. 451–456 (cit. on p. 6).
66 Bibliography
[Mol+12] Omar Moling, Linas Baltrunas, and Francesco Ricci. “Optimal Radio Channel
Recommendations with Explicit and Implicit Feedback”. In: Proceedings of the
Sixth ACM Conference on Recommender Systems. RecSys ’12. 2012, pp. 75–82
(cit. on p. 5).
[Moo+12] Jashua L. Moore, Shuo Chen, Thorsten Joachims, and Douglas Turnbull. “Learn-
ing to Embed Songs and Tags for Playlist Prediction”. In: Conference of the In-
ternational Society for Music Information Retrieval. Vol. 12. 2012, pp. 349–354
(cit. on pp. 8, 15, 45).
[Oh+11] Jinoh Oh, Sun Park, Hwanjo Yu, Min Song, and Seung-Taek Park. “Novel
Recommendation Based on Personal Popularity Tendency”. In: Proceedings of
the 2011 IEEE 11th International Conference on Data Mining. ICDM ’11. 2011,
pp. 507–516 (cit. on p. 28).
[Oor+13] Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. “Deep
Content-Based Music Recommendation”. In: Advances in Neural Information
Processing Systems. Curran Associates, Inc., 2013, pp. 2643–2651 (cit. on p. 16).
[Pac+01] François Pachet, Gert Westermann, and Damien Laigre. “Musical Data Mining
for Electronic Music Distribution”. In: Proceedings of the First International
Conference on WEB Delivering of Music. WEDELMUSIC ’01. 2001, p. 101 (cit. on
p. 19).
[Pan+08] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N. Liu, Rajan Lukose, Martin Scholz,
et al. “One-Class Collaborative Filtering”. In: Proceedings of the 2008 Eighth
IEEE International Conference on Data Mining. ICDM ’08. 2008, pp. 502–511
(cit. on p. 18).
[Par+11] Sung Eun Park, Sangkeun Lee, and Sang-goo Lee. “Session-Based Collabo-
rative Filtering for Predicting the Next Song”. In: Proceedings of the 2011
First ACIS/JNU International Conference on Computers, Networks, Systems and
Industrial Engineering. CNSI ’11. 2011, pp. 353–358 (cit. on p. 20).
[Poh+05] Tim Pohle, Elias Pampalk, and Gerhard Widmer. “Generating Similarity-Based
Playlists using Traveling Salesman Algorithms”. In: Proceedings of the 8th
International Conference on Digital Audio Effects. 2005, pp. 220–225 (cit. on
p. 15).
Bibliography 67
[Poh+07] Tim Pohle, Peter Knees, Markus Schedl, and Gerhard Widmer. “Building an In-
teractive Next-Generation Artist Recommender Based on Automatically Derived
High-Level Concepts”. In: International Workshop on Content-Based Multimedia
Indexing. 2007, pp. 336–343 (cit. on p. 16).
[Pu+11] Pearl Pu, Li Chen, and Rong Hu. “A User-centric Evaluation Framework for
Recommender Systems”. In: Proceedings of the Fifth ACM Conference on Recom-
mender Systems. RecSys ’11. 2011, pp. 157–164 (cit. on p. 47).
[Pál+14] Róbert Pálovics, András A. Benczúr, Levente Kocsis, Tamás Kiss, and Erzsébet
Frigó. “Exploiting Temporal Influence in Online Recommendation”. In: Pro-
ceedings of the 8th ACM Conference on Recommender Systems. RecSys ’14. 2014,
pp. 273–280 (cit. on p. 5).
[Rib+14] Marco Tulio Ribeiro, Nivio Ziviani, Edleno Silva De Moura, Itamar Hata, Anisio
Lacerda, and Adriano Veloso. “Multiobjective Pareto-Efficient Approaches for
Recommender Systems”. In: ACM Transactions on Intelligent Systems Technology
5.4 (Dec. 2014), 53:1–53:20 (cit. on p. 28).
[Sar+12] Andy M Sarroff, Andy M Casey MichaelSarroff, and Michael Casey. “Modeling
and Predicting Song Adjacencies in Commercial Albums”. In: Proceedings of
Sound and Music Computing. 2012 (cit. on p. 34).
[Sch+11] Jan Schlüter and Christian Osendorfer. “Music Similarity Estimation with
the Mean-Covariance Restricted Boltzmann Machine”. In: 10th International
Conference on Machine Learning and Applications and Workshops. Vol. 2. 2011,
pp. 118–123 (cit. on p. 16).
[Sch+17a] Thomas Schäfer and Claudia Mehlhorn. “Can personality traits predict musical
style preferences? A meta-analysis”. In: Personality and Individual Differences
116.Supplement C (2017), pp. 265 –273 (cit. on p. 57).
[Sch+17b] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi
Elahi. “Current Challenges and Visions in Music Recommender Systems Re-
search”. In: CoRR abs/1710.03208 (2017). arXiv: 1710.03208 (cit. on pp. 7,
44, 45, 57).
[Sch+17c] Markus Schedl, Peter Knees, and Fabien Gouyon. “New Paths in Music Recom-
mender Systems Research”. In: Proceedings of the Eleventh ACM Conference on
Recommender Systems. RecSys ’17. 2017, pp. 392–393 (cit. on pp. 3, 6, 7).
[Sha+09] Yuval Shavitt and Udi Weinsberg. “Song Clustering Using Peer-to-Peer Co-
occurrences”. In: 11th IEEE International Symposium on Multimedia. 2009,
pp. 471–476 (cit. on p. 18).
[Sha+95] Upendra Shardanand and Pattie Maes. “Social Information Filtering: Algorithms
for Automating Word of Mouth”. In: Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. CHI ’95. 1995, pp. 210–217 (cit. on p. 2).
68 Bibliography
[Shi+12] Yue Shi, Xiaoxue Zhao, Jun Wang, Martha Larson, and Alan Hanjalic. “Adaptive
Diversification of Recommendation Results via Latent Factor Portfolio”. In:
Proceedings of the 35th International ACM SIGIR Conference on Research and
Development in Information Retrieval. SIGIR ’12. 2012, pp. 175–184 (cit. on
p. 28).
[Sla+06] Malcolm Slaney and William White. “Measuring Playlist Diversity for Recom-
mendation Systems”. In: Proceedings of the 1st ACM Workshop on Audio and
Music Computing Multimedia. AMCMM ’06. 2006, pp. 77–82 (cit. on pp. 28, 34,
43).
[Sla11] Malcolm Slaney. “Web-Scale Multimedia Analysis: Does Content Matter?” In:
IEEE MultiMedia 18.2 (2011), pp. 12–15 (cit. on p. 15).
[Stu+11] Simone Stumpf and Sam Muscroft. “When Users Generate Music Playlists:
When Words Leave Off, Music Begins?” In: 2011 IEEE International Conference
on Multimedia and Expo. 2011, pp. 1–6 (cit. on p. 36).
[Swe+02] Kirsten Swearingen and Rashmi Sinha. “Interaction Design for Recommender
Systems”. In: Designing Interactive Systems 6.12 (2002), pp. 312–334 (cit. on
p. 36).
[Tan+16] Yong Kiam Tan, Xinxing Xu, and Yong Liu. “Improved Recurrent Neural Net-
works for Session-based Recommendations”. In: CoRR abs/1606.08117 (2016)
(cit. on p. 20).
[Tin+17] Nava Tintarev, Christoph Lofi, and Cynthia C.S. Liem. “Sequences of Diverse
Song Recommendations: An Exploratory Study in a Commercial System”. In:
Proceedings of the 25th Conference on User Modeling, Adaptation and Personal-
ization. UMAP ’17. 2017, pp. 391–392 (cit. on pp. 7, 36).
[Tur+15] Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, and
Paolo Cremonesi. “30Music Listening and Playlists Dataset”. In: Poster Proceed-
ings RecSys ’15. 2015 (cit. on pp. 22, 44).
[Tza+02] George Tzanetakis and Perry Cook. “Musical Genre Classification of Audio
Signals”. In: IEEE Transactions on Speech and Audio Processing 10.5 (2002),
pp. 293–302 (cit. on p. 16).
[Tza02] George Tzanetakis. “Manipulation, Analysis and Retrieval Systems for Audio
Signals”. PhD thesis. Princetion University, 2002 (cit. on p. 16).
[Vas+16] Flavian Vasile, Elena Smirnova, and Alexis Conneau. “Meta-Prod2Vec: Product
Embeddings Using Side-Information for Recommendation”. In: Proceedings of
the 10th ACM Conference on Recommender Systems. RecSys ’16. 2016, pp. 225–
232 (cit. on pp. 5, 8, 18).
[VG+05] Rob Van Gulik and Fabio Vignoli. “Visual Playlist Generation on the Artist
Map.” In: Conference of the International Society for Music Information Retrieval
(ISMIR). Vol. 5. 2005, pp. 520–523 (cit. on p. 16).
[Vig+05] Fabio Vignoli and Steffen Pauws. “A Music Retrieval System Based on User
Driven Similarity and Its Evaluation.” In: Conference of the International Society
for Music Information Retrieval. 2005, pp. 272–279 (cit. on p. 15).
Bibliography 69
[Voz+03] Emmanouil Vozalis and Konstantinos G Margaritis. “Analysis of Recommender
Systems Algorithms”. In: The 6th Hellenic European Conference on Computer
Mathematics & its Applications. 2003, pp. 732–745 (cit. on p. 3).
[Wan+12] Xinxi Wang, David Rosenblum, and Ye Wang. “Context-aware Mobile Music Rec-
ommendation for Daily Activities”. In: Proceedings of the 20th ACM International
Conference on Multimedia. MM ’12. 2012, pp. 99–108 (cit. on p. 57).
[Wan+14] Xinxi Wang and Ye Wang. “Improving Content-based and Hybrid Music Recom-
mendation Using Deep Learning”. In: Proceedings of the 22Nd ACM International
Conference on Multimedia. MM ’14. 2014, pp. 627–636 (cit. on p. 16).
[Wei+16] Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Guenter Rudolph, eds. Music
Data Analysis: Foundations and Applications. CRC Press, 2016 (cit. on p. 16).
[Whi+02] Brian Whitman and Steve Lawrence. “Inferring Descriptions and Similarity for
Music from Community Metadata.” In: Proceedings of the 2002 International
Computer Music Conference. 2002 (cit. on p. 17).
[Wu+13] Xiang Wu, Qi Liu, Enhong Chen, Liang He, Jingsong Lv, Can Cao, et al. “Person-
alized Next-song Recommendation in Online Karaokes”. In: Proceedings of the
7th ACM Conference on Recommender Systems. RecSys ’13. 2013, pp. 137–140
(cit. on pp. 8, 24).
[Zan+12] Eva Zangerle, Wolfgang Gassler, and Günther Specht. “Exploiting Twitter’s
Collective Knowledge for Music Recommendations.” In: Proceedings of the 21st
International World Wide Web Conference: Making Sense of Microposts. 2012,
pp. 14–17 (cit. on pp. 18, 22).
[Zha+08] Mi Zhang and Neil Hurley. “Avoiding Monotony: Improving the Diversity
of Recommendation Lists”. In: Proceedings of the 2008 ACM Conference on
Recommender Systems. RecSys ’08. 2008, pp. 123–130 (cit. on p. 28).
[Zhe+10] Elena Zheleva, John Guiver, Eduarda Mendes Rodrigues, and Nataša Milić-
Frayling. “Statistical Models of Music-Listening Sessions in Social Media”. In:
Proceedings of the 19th International Conference on World Wide Web. 2010,
pp. 1019–1028 (cit. on p. 17).
[Zie+05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen.
“Improving Recommendation Lists Through Topic Diversification”. In: Proceed-
ings of the 14th International Conference on World Wide Web. WWW ’05. 2005,
pp. 22–32 (cit. on pp. 28, 44).
[Özg+14] Özlem Özgöbek, Jon Atle Gulla, and Riza Cenk Erdur. “A Survey on Chal-
lenges and Methods in News Recommendation”. In: Proceedings of the 10th
International Conference on Web Information Systems and Technologies. 2014,
pp. 278–285 (cit. on p. 21).
70 Bibliography
Web pages
[Ber14] Erik Bernhardsson. Recurrent Neural Networks for Collaborative Filtering. 2014.
URL : https://erikbern.com/2014/06/28/recurrent- neural- networks-
for-collaborative-filtering.html (cit. on p. 20).
[Fri16] Joshua P. Friedlander. News and Notes on 2017 Mid-Year RIAA Revenue Statistics.
2016. URL: https://www.riaa.com/wp-content/uploads/2017/09/RIAA-
Mid-Year-2017-News-and-Notes2.pdf (cit. on p. 1).
[Goo13] Howard Goodall. BBC Howard Goodall’s Story of Music – Part1. 2013. URL:
https://www.youtube.com/watch?v=I0Y6NPahlDE (cit. on p. 1).
[Hog15] Marc Hogan. Up Next: How Playlists Are Curating the Future of Music. 2015.
URL : https : / / pitchfork . com / features / article / 9686 - up - next - how -
playlists-are-curating-the-future-of-music/ (cit. on p. 2).
[Joh+15] Chris Johnson and Edward Newett. From Idea to Execution: Spotify’s Discover
Weekly. 2015. URL: https://de.slideshare.net/MrChrisJohnson/from-
idea-to-execution-spotifys-discover-weekly/ (cit. on pp. 5, 17, 18).
[Joh14] Chris Johnson. Algorithmic Music Discovery at Spotify. 2014. URL: https://de.
slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-
at-spotify/ (cit. on pp. 17, 18).
Web pages 71
List of Figures
73
Publications
In this thesis by publication the following six works of the author are included. These
publications are related to next-track music recommendation. The full texts of these
works can be found after this list.
75
• Iman Kamehkhosh, Dietmar Jannach, and Malte Ludewig. “A Comparison of
Frequent Pattern Techniques and a Deep Learning Method for Session-Based
Recommendation”. In: Proceedings of the Workshop on Temporal Reasoning in
Recommender Systems at ACM RecSys. 2017, pp. 50–56
In addition to these six main publications, the author of this thesis worked on the
following other publications related to recommender systems that are not part of
this thesis.
76 Publications
Analyzing the Characteristics of Shared Playlists for
Music Recommendation
ABSTRACT 1. INTRODUCTION
The automated generation of music playlists – as supported The automated creation of playlists or personalized radio
by modern music services like last.fm or Spotify – represents stations is a typical feature of today’s online music plat-
a special form of music recommendation. When designing forms and music streaming services. In principle, standard
a “playlisting” algorithm, the question arises which kind of recommendation algorithms based on collaborative filtering
quality criteria the generated playlists should fulfill and if or content-based techniques can be applied to generate a
there are certain characteristics like homogeneity, diversity ranked list of musical tracks given some user preferences
or freshness that make the playlists generally more enjoyable or past listening history. For several reasons, the gener-
for the listeners. In our work, we aim to obtain a better un- ation of playlists however represents a very specific music
derstanding of such desired playlist characteristics in order recommendation problem. Personal playlists are, for exam-
to be able to design better algorithms in the future. The ple, often created with a certain goal or usage context (e.g.,
research approach chosen in this work is to analyze several sports, relaxation, driving) in mind. Furthermore, in con-
thousand playlists that were created and shared by users on trast to relevance-ranked recommendation lists used in other
music platforms based on musical and meta-data features. domains, playlists typically obey some homogeneity and co-
Our first results for example reveal that factors like pop- herence criteria, i.e., there are quality characteristics that
ularity, freshness and diversity play a certain role for users are related to the transitions between the tracks or to the
when they create playlists manually. Comparing such user- playlist as a whole.
generated playlists with automatically created ones more- In the research literature, a number of approaches for the
over shows that today’s online playlisting services sometimes automation of the playlist generation process have been pro-
generate playlists which are quite different from user-created posed, see, e.g., [2, 6, 8, 10, 11] or the recent survey in
ones. Finally, we compare the user-created playlists with [3]. Some of them for example take a seed song or artist
playlists generated with a nearest-neighbor technique from as an input and look for similar tracks; others try to find
the research literature and observe even stronger differences. track co-occurrence patterns in existing playlists. In some
This last observation can be seen as another indication that approaches, playlist generation is considered as an optimiza-
the accuracy-based quality measures from the literature are tion problem. Independent of the chosen technique, a com-
probably not sufficient to assess the effectiveness of playlist- mon problem when designing new playlisting algorithms is
ing algorithms. to assess whether or not the generated playlists will be posi-
tively perceived by the listeners. User studies and online ex-
Categories and Subject Descriptors periments are unfortunately particularly costly in the music
domain. Researchers therefore often use offline experimen-
H.3.3 [Information Storage and Retrieval]: Information tal designs and for example use existing playlists shared by
Search and Retrieval; H.5.5 [Information Interfaces and users on music platforms as a basis for their evaluations. The
Presentation]: Sound and Music Computing assumption is that these “hand-crafted” playlists are of good
quality; typical measures used in the literature include the
General Terms Recall [8] or the Average Log-Likelihood (ALL) [11]. Un-
Playlist generation, Music recommendation fortunately, both measures have their limitations, see also
[2]. The Recall measure for example tells us how good an
Keywords algorithm is at predicting the tracks selected by the users,
Music, playlist, analysis, algorithm, evaluation but does not explicitly capture specific aspects such as the
homogeneity or the smoothness of track transitions.
To design better and more comprehensive quality mea-
sures, we however first have to answer the question of what
users consider to be desirable characteristics of playlists or
what the driving principles are when users create playlists.
In the literature, a few works have studied this aspect using
Proceedings of the 6th Workshop on Recommender Systems and the Social Web different approaches, e.g., user studies [1, 7] or analyzing fo-
(RSWeb 2014), collocated with ACM RecSys 2014, 10/06/2014, Foster City, CA, USA. rum posts [5]. The work presented in this paper continues
Copyright held by the authors. these lines of research. Our research approach is however
.
different from previous works as we aim to identify patterns Reynolds et al. [12] made an online survey that revealed
in a larger set of manually created playlists that were shared that the context and environment like the location activity
by users of three different online music platforms. To be able or the weather can have an influence both on the listeners’
to take a variety of potential driving factors into account in mood and on the track selection behavior of playlist cre-
our analysis, we have furthermore collected various types of ators. Finally, the study presented in [9] again confirmed
meta-data and musical features of the playlist tracks from the importance of artists, genres and mood in the playlist
public music databases. creation process.
Overall, with our analyses we hope to obtain insights on In this discussion, we have focused on previous attempts
the principles which an automated playlist generation sys- to understand how users create playlists and what their char-
tem should observe to end up with better-received or more acteristics are. Playlist generation algorithms however do
“natural” playlists. To test if current music services and not necessarily have to rely on such knowledge. Instead,
a nearest-neighbor algorithm from the literature generate one can follow a statistical approach and only look at co-
playlists that observe the identified patterns and make sim- occurrences and transitions of tracks in existing playlists and
ilar choices as real users, we conducted an experiment in use these patterns when creating new playlists, see e.g., [2]
which we analyzed commonalities and differences between or [4]. This way, the quality factors respected by human
automatically generated and user-provided playlists. playlist creators are implicitly taken into account. Such
Before reporting the details of our first analyses, we will approaches, however, cannot be directly applied for many
first discuss previous works in the next section. types of playlist generation settings, e.g., for creating “the-
matic” playlists (e.g., Christmas Songs) or for creating play-
lists that only contain tracks that have certain musical fea-
2. PREVIOUS WORKS tures. Pure statistical methods are not aware of these char-
acteristics and the danger exists that tracks are included
In [14], Slaney and White addressed the question if users
that do not match the purpose of the list and thus lead to
have a tendency to create very homogeneous or rather di-
a limited overall quality.
verse playlists. As a basis for determining the diversity they
relied on an objective measure based on genre information
about the tracks. Each track was considered as a point in 3. CHARACTERISTICS OF PLAYLISTS
the genre space and the diversity was then determined by The ultimate goal of our research is to analyze the struc-
calculating the volume of an ellipsoid enclosing the tracks of ture and characteristics of playlists in order to better under-
the playlist. An analysis of 887 user-created playlists indi- stand the principles used by the users to create them. This
cated that diversity can be considered to be a driving factor section is a first step toward this goal.
as users typically create playlists covering several genres.
Sarroff and Casey more recently [13] focused on track tran- 3.1 Data sources
sitions in album playlists and made an analysis to determine As a basis for the first analyses that we report in this
if there are certain musical characteristics that are particu- paper, we used two types of playlist data.
larly important. One of the results of their investigation was
that fade durations and the mean timbre of the beginnings 3.1.1 Hand-crafted playlists
and endings of consecutive tracks seem to have a strong in- We used samples of hand-crafted playlists from three dif-
fluence on the ordering of the tracks. ferent sources. One set of playlists was retrieved via the
Generally, our work is similar to [14] and [13] in that we public API of last.fm1 , one was taken from the Art of the
rely on user-created (“hand-crafted”) playlists and look at Mix (AotM) website2 , and a third one was provided to us by
meta-data and musical features of the tracks to identify po- 8tracks3 . To enhance the data quality, we corrected artist
tentially important patterns. The aspects we cover in this misspellings using the API of last.fm.
paper were however not covered in their work and our anal- Overall, we analyzed over 10,000 playlists containing about
ysis is based on larger datasets. 108,000 different tracks of about 40,000 different artists. As
Cunningham et al., [5], in contrast, relied on another form a first attempt toward our goal, we retrieved the features
of track-related information and looked at the user posts in listed in Table 1 using the public API of last.fm and The
the forum of the Art of the Mix web site. According to their Echo Nest (tEN), and the MusicBrainz database.
analysis, the typical principles for setting up the playlists Some dataset characteristics are shown in Table 2. The
mentioned by the creators were related to the artist, genre, “usage count” statistics express how often tracks and artists
style, event or activity but also the intended purpose, con- appeared overall in the playlists. When selecting the playlists,
text or mood. Some users also talked about the smoothness we made sure that they do not simply contain album list-
of track transitions and how many tracks of one single artist ings. The datasets are partially quite different, e.g., with
should be included in playlists. Placing the most “impor- respect to the average playlist lengths. The 8tracks dataset
tant” track at the end of a playlist was another strategy furthermore has the particularity that users are not allowed
mentioned by some of the playlist creators. to include more than two tracks of one artist, in case they
A different form of identifying playlist creation principles want to share their playlist with others.
is to conduct laboratory studies with users. The study re- Figure 1 shows the distributions of playlist lengths. As
ported in [7] for example involved 52 subjects and indicated can be seen, the distributions are quite different across the
that the first and the last tracks can play an important role datasets. On 8tracks, a playlist generally has to comprise
for the quality of a playlist. In another study, Andric and
1
Haus [1] concluded that the ordering of tracks is not im- http://www.last.fm
2
portant when the playlist mainly contains tracks which the http://www.artofthemix.org
3
users like in general. http://8tracks.com
at least 8 tracks. The lengths of the last.fm playlists seem
Source Information Description to follow a normal distribution with a maximum frequency
last.fm Tags Top tags assigned by users to value at around 20 tracks. Finally, the sizes of the AotM
the track. playlists are much more equally distributed.
last.fm Playcounts Total number of times the
users played the track.
3.1.2 Generated playlists
tEN Genres Genres of the artist of the To assess if the playlists generated by today’s online ser-
track. Multiple genres can be vices are similar to those created by users, we used the public
assigned to a single artist. API of The Echo Nest. We chose this service because it uses
tEN Danceability Suitability of the track for a very large database and allows the generation of playlists
dancing, based on various in- from several seed tracks, as opposed to, for instance, iTunes
formation including the beat Genius or last.fm radios. We split the existing hand-crafted
strength and the stability of playlists in half, provided the first half of the list as seed
the tempo. tracks to the music service and then analyzed the character-
tEN Energy Intensity released throughout istics of the playlist returned by The Echo Nest and com-
the track, based on various in- pared them to the patterns that we found in hand-crafted
formation including the loud- playlists. Instead of observing whether a playlister gener-
ness and segment durations. ates playlists that are generally similar to playlists created
by hand, our goal here is to break down their different char-
tEN Loudness Overall loudness of the track
acteristics and observe on what specific dimensions they dif-
in decibels (dB).
fer. Notice that using the second half as seed would not be
tEN Tempo Speed of the track estimated
appropriate as the order of the tracks may be important.
in beats per minute (BPM).
We also draw our attention to the ability of the algorithms
tEN Hotttnesss Current reputation of the
of the literature to reproduce the characteristics of hand-
track based on its activity on
crafted playlists. According to some recent research, one of
some web sites crawled by the
the most competitive approaches in terms of recall is the
developers.
simple k-nearest-neighbors (kNN) method [2, 8]. More pre-
MB Release year Year of release of the corre- cisely, given some seed tracks, the algorithm extracts the k
sponding album. most similar playlists based on the number of shared items
and recommends the tracks of these playlists. This algo-
Table 1: Additional retrieved information.
rithm does not require a training step and scans the entire
set of available playlists for each recommendation.
800
(tEN) and kNN playlisters5 . We provided the first half of
8tracks
600 the hand-crafted playlists as seed tracks and the playlisters
400 had to select the same number of tracks as the number of
remaining tracks.
200
The results show that users actually tend to place more
0 popular items in the first part of the list in all datasets,
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 when play counts are considered. The Echo Nest playlister
Playlist sizes does not seem to take that form of popularity into account
4
We organized the average play counts in 100 bins.
Figure 1: Distribution of playlists sizes. 5
We determined 10 as the best neighborhood size for our
data sets based on the recall value, see Section 4.
Play counts 1st half 2nd half tEN measure, we compared the creation year of each playlist with
last.fm 1,007k 893k 629k the average release year of its tracks. We limit our analysis
AotM 671k 638k 606k to the last.fm and 8tracks datasets because we only could
8tracks 953k 897k 659k acquire creation dates for these two.
Gini index 1st half 2nd half tEN 0.18
last.fm 0.06 0.04 0.04 0.16
Relative frequency
AotM 0.20 0.18 0.22 0.14
8tracks 0.09 0.09 0.08 8tracks
0.12
0.1
last.fm
Play counts 1st half 2nd half kNN 0.08
last.fm 1,110k 943k 1,499k 0.06
AotM 645k 617k 867k 0.04
8tracks 1,008k 984k 1,140k 0.02
0
Gini index 1st half 2nd half kNN
0 5 10 15 20 25 30
last.fm 0.12 0.09 0.33
Average freshness of playlists (years)
AotM 0.26 0.23 0.43
8tracks 0.15 0.12 0.28
Figure 2: Distribution of average freshness of
Table 3: Popularity of tracks in playlists (last.fm playlists (comparing playlist creation date and track
play counts) and concentration bias (Gini coeffi- release date).
cient).
Figure 2 shows the statistics for both datasets. We orga-
nized the data points in bins (x-axis), where each bin repre-
and recommends on average less popular tracks. These dif- sents an average-freshness level, and then counted how many
ferences are statistically significant according to a Student’s playlists fall into these levels. The relative frequencies are
t-test (p < 10−5 for The Echo Nest playlister and p < 10−7 shown on the y-axis. The result are very similar for both
for the kNN playlister). This behavior indicates also that datasets, with a slight tendency to include older tracks for
The Echo Nest is successfully replicating the fact that the last.fm. On both datasets, more than half of the playlists
second halves of playlists are supposed to be less popular contain tracks that were released on average in the last 5
than the first half. years, the most frequent average age being between 4 and
The Gini index reveals that there is a slightly stronger con- 5 years for last.fm and between 3 and 4 years for 8tracks.
centration on some tracks in the first half for two of three Similarly, on both datasets, more than 75% of the playlists
datasets and the diversity slightly increases in the second contain tracks that were released on average in the last 8
part. The absolute numbers cannot be directly compared years.
across datasets, but for the AotM dataset the concentra- We also analyzed the standard deviation of the resulting
tion is generally much higher, which is also indicated by the freshness values and observed that more than half of the
higher “track reuse” in Table 2. Interestingly, The Echo Nest playlists have a standard deviation of less than 4 (years),
playlister quite nicely reproduces the behavior of real users while more than 75% have a standard deviation of less than 7
with respect to the diversity of popularity. (years) on both datasets. Overall, this suggests that playlists
In the lower part of Table 3, we show the results for made by users are often homogeneous with regard to the
the kNN method. Note that these statistics are based on release date.
a different sample of the playlists than the previous mea- Computing the freshness for the generated playlists would
surement. The reason is that both The Echo Nest and the require to configure the playlisters in such a way that they
kNN playlisters cannot produce playlists for all of the first select only tracks that were not released after the playlists’
halves provided as seed tracks. We therefore considered only creation years. Unfortunately, The Echo Nest does not allow
playlists, for which the corresponding algorithm could pro- such a configuration. Moreover, for the kNN approach, the
duce a playlist. playlists that are more recent would have to be ignored,
Unlike the playlister of The Echo Nest, the kNN method which would lead to a too small sample size and not very
has a strong trend to recommend mostly very popular items. reliable results anymore.
This can be caused by the fact that the kNN method by
design recommends tracks that are often found in similar 3.2.3 Homogeneity and diversity
playlists. Moreover, based on the lower half of Table 3, the Homogeneity and diversity can be determined in a variety
popularity correlates strongly with the seed track popularity. of ways. In the following, we will use simple measures based
As a result, the kNN shows a potentially undesirable trend on artist and genre counts. The genres correspond to the
to reinforce already popular items to everyone. At the same genres of the artists of the tracks retrieved from The Echo
time, it concentrates the track selection on a comparable Nest. Basic figures for artist and genre diversity are already
small number of tracks as indicated by the very high value given in Table 2. On AotM, for example, having several
for the Gini coefficient. tracks of an artist in a playlist is not very common6 . On
last.fm, we in contrast very often see two or more tracks of
3.2.2 The role of freshness
Next, we analyzed if there is a tendency of users to create 6
On 8tracks, artist repetitions are limited due to license con-
playlists that mainly contain recently released tracks. As a straints
0.25
one artist in a playlist. A similar, very rough estimate can
Energy [0,1]
be made for the genre diversity. If we ordered the tracks of 0.2
a playlist by genre, we would encounter a different genre on Hotttnesss [0,1]
Relative frequency
last.fm only after having listened to about 10 tracks. On 0.15 Loudness [-100,100]
AotM and 8tracks, in contrast, playlists on average cover Danceability [0,1]
more genres. 0.1
Tempo [0,500]
Table 4 shows the diversities of the first and second halves
0.05
of the hand-crafted playlists, and for the automatic selec-
tions using the first halves as seeds. As a measure of di- 0
versity, we simply counted the number of artists and genres 0 20 40 60 80 100
and divided by the corresponding number of tracks. The Scale
values in Table 4 correspond the averages of these diversity
measures. Figure 3: Distribution of The Echo Nest track mu-
sical features independently of playlists.
1st half 2nd half tEN
last.fm artists 0.74 0.76 0.93
genres 2.26 2.30 2.12 0.16 8tracks - Energy
AotM artists 0.93 0.93 0.94
AotM - Energy
genres 3.26 3.22 2.41 0.14
8tracks artists 0.97 0.98 0.99 last.fm - Energy
0.12
genres 3.74 3.85 2.89 8tracks - Hotttnesss
Relative frequency
0.1 AotM - Hotttnesss
1st half 2nd half kNN last.fm - Hotttnesss
0.08
last.fm artists 0.74 0.76 0.87
genres 2.32 2.26 3.11 0.06
AotM artists 0.94 0.94 0.91
0.04
genres 3.27 3.21 3.70
8tracks artists 0.97 0.98 0.93 0.02
genres 3.94 3.92 4.06
0
0 0.2 0.4 0.6 0.8 1
Table 4: Diversity of playlists (Number of artists Energy and Hotttnesss
and genres divided by the corresponding number of
tracks).
Figure 4: Distribution of mean energy and “hottt-
nesss” levels in playlists.
Regarding the diversity of the hand-crafted playlists, the
tables show that users tend to keep a same level of artist and
genre diversity throughout the playlists. We can also notice To understand if people tend to place tracks with specific
that the playlists of last.fm are much more homogeneous. feature values into their playlists, we then computed the
The diversity values of the automatic selections reveal sev- distribution of the average feature values of each playlist.
eral things. First, The Echo Nest playlister tends to always Figure 4 shows the results of this measurement for the en-
maximize the artist diversity independently of the diversity ergy and “hotttnesss” features. For all the other features
of the seeds; on the contrary, the kNN playlister lowered the (danceability, loudness and tempo), the distributions were
initial artist diversities, except on the last.fm dataset, where similar to those of Figure 3, which could mean that they are
it increased them, though less than The Echo Nest playlister. generally not particularly important for the users.
Regarding the genre diversity, we can observe an opposite When looking at the energy feature, we see that users tend
tendency for both playlisters: The Echo Nest playlister tends to include tracks from a comparably narrow energy spectrum
to reduce the genre diversity while the kNN playlister tends with a low average energy level, even though there exist
to increase it. Again, these difference are statistically signif- more high-energy tracks in general as shown in Figure 3. A
icant (p < 0.03 for The Echo Nest playlister and p < 0.006 similar phenomenon of concentration on a certain range of
for the kNN playlister). Overall, the resulting diversities of values can be observed for the “hotttnesss” feature. As a
the both approaches tend to be rather dissimilar to those of side aspect, we can observe that the tracks shared on AotM
the hand-crafted playlists. are on average slightly less “hottt” than those of both other
platforms7 .
3.2.4 Musical features (The Echo Nest) We finally draw our attention to the feature distributions
Figure 3 shows the overall relative frequency distribution of the generated playlists. Figure 5 as an example shows
of the numerical features from The Echo Nest listed in Ta- the distributions of the energy and “hotttnesss” factors for
ble 1 for the set of tracks appearing in our playlists on a 7
normalized scale. For the loudness feature, for example, we The results for the “hotttnesss” we report here correspond
see that most tracks have values between 40 and 50 on the to the values at the time when we retrieved the data using
the API of The Echo Nest, and not to those at the time when
normalized scale. This would translate into an actual loud- the playlists were created. This is not important as we do
ness value of -20 to 0 returned by The Echo Nest, given that not look at the distributions independently, but compare
the range is -100 to 100. them to the distributions in Figure 3.
0.1
1st half 1st half 2nd half tEN
0.09
2nd half last.fm artists 0.19 0.18 0
0.08
tEN genres 0.43 0.40 0.56
0.07
energy 0.76 0.71 0.77
Relative frequeny
0.06 kNN10
0.05
hotttnesss 0.81 0.76 0.83
0.04 AotM artists 0.05 0.05 0
0.03 genres 0.24 0.22 0.50
0.02 energy 0.75 0.74 0.75
0.01 hotttnesss 0.83 0.82 0.85
0 8tracks artists 0.02 0.01 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Energy
genres 0.22 0.22 0.52
0.25
energy 0.73 0.71 0.76
1st half
hotttnesss 0.81 0.79 0.85
2nd half
0.2
tEN 1st half 2nd half kNN
Relative frequency
kNN10
0.15 last.fm artists 0.22 0.21 0.02
genres 0.44 0.42 0.14
0.1 energy 0.76 0.76 0.75
hotttnesss 0.83 0.82 0.83
0.05
AotM artists 0.05 0.05 0.03
0
genres 0.22 0.21 0.13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 energy 0.75 0.74 0.73
Hotttnesss hotttnesss 0.83 0.82 0.84
8tracks artists 0.02 0.01 0.03
Figure 5: Comparison of the distribution of energy genres 0.22 0.22 0.17
and “hotttnesss” levels for hand-crafted and gener- energy 0.74 0.73 0.74
ated playlists. hotttnesss 0.82 0.80 0.84
ABSTRACT common on e-commerce sites, e.g., when returning users do not log
Making session-based recommendations, i.e., recommending items in every time they use the site. The same challenges can, however,
solely based on the users’ last interactions without having access be observed also for other application domains, in particular for
to their long-term preference profiles, is a challenging problem news and media (music and video) recommendation [21, 33].
in various application fields of recommender systems. Using a The problem of predicting the next actions of users based solely
coarse classification scheme, the proposed algorithmic approaches on their sequence of actions in the current session is referred to
to this problem in the research literature can be categorized into in the literature as session-based recommendation. A number of
frequent pattern mining algorithms and approaches that are based algorithmic approaches have been proposed over the years to deal
on sequence modeling. In the context of methods of the latter class, with the problem. Early academic approaches, for example, rely
recent works suggest the application of recurrent neural networks on the detection of sequential patterns in the session data of a
(RNN) for the problem. However, the lack of established algorithmic larger user community. In principle, even simpler methods can be
baselines for session-based recommendation problems makes the applied. Amazon’s “Customers who bought . . . also bought” feature
assessment of such novel approaches difficult. represents an example that relies on simple co-occurrence patterns
In this work, we therefore compare a state-of-the-art RNN-based to generate recommendations, in that case in the context of the
approach with a number of (heuristics-based) frequent pattern very last user interaction (an item view event). A number of later
mining methods both with respect to the accuracy of their recom- works then explored the use of Markov models [30, 35, 39], and
mendations and with respect to their computational complexity. most recently, researchers explored the use of recurrent neural
The results obtained for a variety of different datasets show that in networks (RNN) for the session-based next-item recommendation
every single case a comparably simple frequent pattern method can problem [16, 17, 38, 42].
be found that outperforms the recent RNN-based method. At the Today, RNNs can be considered one of the state-of-the-art meth-
same time, the proposed much more simple methods are also com- ods for sequence learning tasks. They have been successfully ex-
putationally less expensive and can be applied within the narrow plored for various sequence-based prediction problems in the past
time constraints of online recommendation. [5, 9, 11, 18] and in a recent work, Hidasi et al. [16] investigated an
RNN variant based on gated recurrent units (GRU) for the session-
CCS CONCEPTS based recommendations problem. In their work, they benchmarked
their RNN-based method gru4rec with different baseline methods
•General and reference →Evaluation; •Information systems
on two datasets. Their results showed that gru4rec is able to out-
→Recommender systems; •Computing methodologies →Ne-
perform the baseline approaches in terms of accuracy for top-20
ural networks; Rule learning;
recommendation lists.
KEYWORDS While these results indicate that RNNs can be successfully ap-
plied for the given recommendation task, we argue that the experi-
Session-Based Recommendations; Deep Learning; Frequent Pattern mental evaluation in [16] does not fully inform us about different
Mining; Benchmarking aspects of the effectiveness and the practicability of the proposed
method. First, regarding the effectiveness it is unclear if the meth-
1 INTRODUCTION ods to which gru4rec was compared are competitive. Second, as
Making recommendations solely based on a user’s current session the evaluation was based on one single training-test split and only
and most recent interactions is a nontrivial problem for recom- using accuracy measures, further investigations are necessary to
mender systems. On an e-commerce website, for instance, when assess, for example, if some algorithms exhibit certain biases, e.g., to
a visitor is new (or not logged in), there are no long-term user recommend mostly popular items. Third, even if the RNN method is
models that can be applied to determine suitable recommendations effective, questions regarding the scalability of the method should
for this user. Furthermore, recent work shows that considering the be discussed, in particular as hyper-parameter optimization for the
user’s short-term intent has often more effect on the accuracy of the complex networks can become very challenging in practice.
recommendations than the choice of the method used to build the The goal of this work is to shed light on these questions and in
long-term user profiles [20]. In general, such types of problems are the remainder of this paper we will report the detailed results of
comparing a state-of-the-art RNN-based method with a number
Workshop on Temporal Reasoning in Recommender Systems, collocated with ACM Rec-
Sys’17, Como, Italy.
of computationally more efficient pattern mining approaches in
Copyright©2017 for this paper by its authors. Copying permitted for private and different dimensions.
academic purposes.
.
2 PREVIOUS WORKS which leverage additional item features to achieve higher accu-
In session-based recommendation problems, we are given a se- racy. For the problem of news recommendation, Song et al. [36]
quence of the most recent actions of a user and the goal is to find proposed a temporal deep semantic structured model for the combi-
items that are relevant in the context of the user’s specific short- nation of long-term static and short-term temporal user preferences.
term intent. One traditional way to determine recommendations They considered different levels of granularity in their model to
given a set of recent items of interest is to apply frequent pat- process both fast and slow temporal changes in the users’ prefer-
tern mining techniques, e.g., based on association rules (AR) [1]. ences. In general, neural networks have been used for a number
AR are often applied for market basket analysis with the goal to of recommendation-related tasks in recent years. Often, such net-
find sets of items that are bought together with some probability works are used to learn embeddings of content features in compact
[14]. The order of the items or actions in a session is irrelevant fixed-size latent vectors, e.g., for music, for images, for video data,
for AR-based approaches. Sequential patterns mining (SP) [2] tech- for documents, or to represent the user [3, 6–8, 13, 25, 29, 46].
niques, in contrast, consider the order of the elements in sessions These representations are then integrated, e.g., in content-based
when identifying frequent patterns. In one of the earlier works, approaches, in variations of latent factor models, or are part of new
Mobasher et al. [32] used frequent pattern mining methods to pre- methods for computing recommendations [7, 8, 13, 27, 37, 43, 45].
dict a user’s next navigation action. In another work, Yap et al. [47] In the work presented in this paper, we will compare different
propose a sequential pattern-mining-based next-item recommen- existing and novel pattern-mining-based approaches with a state-
dation framework, which weights the patterns according to their of-the-art RNN-based algorithm.
estimated relevance for the individual user. In the domain of music
recommendation, Hariri et al. more recently [15] propose to mine 3 EXPERIMENT CONFIGURATIONS
sequential patterns of latent topics based on the tags attached to 3.1 Algorithms
the tracks to predict the context of the next song. 3.1.1 RNN Baseline. gru4rec is an RNN-based algorithm that
A different way of finding item-to-item correlations is to look uses Gated Recurrent Units to deal with the vanishing or exploding
for sessions that are similar to the current one (neighbors), and to gradient problem.[16]. In our experiments, we used the Python
determine frequent item co-occurrence patterns that can be used in implementation that is shared by the authors online.1
the prediction phase. Such neighborhood-based approaches were
for example applied in the domains of e-commerce and music in 3.1.2 Session-based kNN – knn. The knn method searches the k
[4] or [26]. In some cases and application domains, simple co- most similar past sessions (“neighbors”) in the training data based
occurrence patterns are despite their simplicity quite effective, see, on the set of items in the current session. Since the process of
e.g., [20, 40] or [44]. determining the neighbor sessions becomes very time-consuming
Differently from such pattern- and co-occurrence-based tech- as the number of sessions increases, we use an special in-memory
niques, a number of recent approaches are based on sequence mod- index data structure (cache) in our implementation. Technically, in
eling using, e.g., Markov models. The main assumption of Markov- the training phase, we create a data structure that maps the training
model-based approaches in the context of session-based recom- sessions to their set of items and one structure that maps the items
mendation is that the selection of the next item in a session is de- to the sessions in which they appear. To make recommendations for
pendent on a limited number of previous actions. Shani et al. [35] the current session s, we first create a union of the sessions in which
were among the first who applied first-order Markov chains (MC) the items of s appear. This union will be the set of possible neighbors
for session-based recommendation and showed the superiority of of the current session. This is a fast operation as it only involves a
sequential models over non-sequential ones. In the music domain, cache lookup and set operations. To further reduce the computa-
McFee and Lanckriet [30] proposed a music playlist generation tional complexity of the prediction process, we select a subsample
algorithm based on MCs that – given a seed song – selects the of these possible neighbors using a heuristic. In this work, we took
next track from uniform and weighted distributions as well as from the m most recent sessions as focusing on recent trends has shown
k-nearest neighbor graphs. Generally, a main issue when applying to be effective for recommendations in e-commerce [23]. We then
Markov chains in session-based recommendation is that the state compute the similarity of these m most recent possible neighbors
space quickly becomes unmanageable when all possible sequences and the current session and select the k most similar sessions as
of user selections should be considered [12, 16]. the neighbor sessions of the current session. Again through lookup
More recent approaches to sequence modeling for session-based and set union operations, we create the set of recommendable items
recommendation utilize recurrent neural networks (RNN). RNNs R that contains items that appear in one of the k sessions. For each
process sequential data one element at a time and are able to selec- recommendable item i in R, we then compute the knn score as the
tively pass information across sequence steps [28]. Zhang et al. [49], sum of the similarity values of s and its neighbor sessions n ∈ Ns
for example, successfully applied RNNs to predict advertisement which contains i (Equation 1). The indicator function 1n (i) returns
clicks based on the users’ browsing behavior in a sponsored search 1 if n contains i and 0 otherwise, see also [4].
scenario. For session-based recommendations, Hidasi et al. [16] score knn (i, s) = Σn ∈Ns sim(s, n) × 1n (i) (1)
investigated a customized RNN variant based on gated recurrent
In our experiments, we tested different distance measures to
units (GRU) [5] to model the users’ transactions within sessions.
determine the similarity of sessions. The best results were achieved
They also tested several ranking loss functions in their solutions.
when the sessions were encoded as binary vectors of the item space
Later on, in [17] and [42] RNN-based approaches were proposed
1 https://github.com/hidasib/GRU4Rec
and when using cosine similarity. In our implementation, the set Table 1: Dataset characteristics.
operations, similarity computations, and the final predictions can
be done very efficiently as will be discussed later in Section 4.2.2. RSC TMall #nowplaying 30Music AotM 8tracks
Our algorithm has only two parameters, the number of neighbors
Sessions 8M 4.6M 95K 170K 83K 500K
k and the number of sampled sessions m. For the large e-commerce Events 32M 46M 1M 2.9M 1.2M 5.8M
dataset used in [16], the best parameters were, for example, achieved Items 38K 620K 115K 450K 140K 600K
with k = 500 and m = 1000. Note that the kNN method used in Avg. E/S 3.97 9.77 10.37 17.03 14.12 11.40
[16] is based on item-to-item similarities while our kNN methods Avg. I/S 3.17 6.92 9.62 14.20 14.11 11.38
aims to identify similar sessions.
list lengths of 1, 2, 3, 5, 7, 10, 15, and 20. While the experiments 0.2 SR AR KNN TKNN GRU4REC(1000,TOP1) GRU4REC(1000,BPR)
TMall
in [16] are done without cross-validation, we additionally apply a 0.18 *
fivefold sliding-window validation protocol as in [24] to minimize 0.16
the risk that the obtained results are specific to the single train-test 0.14
split. We, therefore, created five train-test splits for each dataset.
0.12
For the listening logs, we used 3 months of training data and the
0.1
next 5 days as the test data and randomized splits for the playlists
0.08
0.06 0.006
0.004
0.04
0.002
0.02 0
MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20
0.3 SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR) 0.008 SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR)
30Music 8tracks *
0.007
0.25 *
0.006
0.2
0.005
0.15 0.004
0.003
0.1
0.002
0.05 0.001
0 0
MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20 MRR@1 MRR@2 MRR@3 MRR@5 MRR@7 MRR@10 MRR@15 MRR@20
Figure 2: MRR results for the listening log datasets. Figure 3: MRR results for the playlist datasets.
gru4rec(100,bpr), which use 100 hidden units and the TOP1 and 0.1
logs datasets. tknn also outperforms both gru4rec configurations SR AR KNN TKNN GRU4REC(100,TOP1) GRU4REC(100,BPR)
significant. Similar results were also achieved for the other datasets, 0