Neo-Riemannian Theory For Generative Film and Videogame Music
Neo-Riemannian Theory For Generative Film and Videogame Music
Music Composition
Philippe Pasquier1 , Jeff Ens1 , Nathan Fradet1 , Paul Triana1 , Davide Rizzotti1 ,
Jean-Baptiste Rolland2 , Maryam Safi2
1
Metacreation Lab - Simon Fraser University, Vancouver, Canada
2
Steinberg Media Technologies GmbH, Hamburg, Germany
[email protected]
arXiv:2501.17011v2 [cs.SD] 4 Feb 2025
et al. 2020; Collins and Barthet 2023; Wu and Yang 2020; <BAR>
FILL END
Liu et al. 2022; Huang and Yang 2020; Hsiao et al. 2021), as
A B
we aim for a style-agnostic system that can accommodate as
many users as possible. Figure 1: The Multitrack (A) and Bar-Fill (B) tokenizations.
As shown in Table 1, most systems either support a sin- The grey <BAR>, <TRACK> and <CONTROL> placeholders
gle track or require a fixed schema of instruments. One correspond token subsequences of complete bars, complete
exception is MuseNet (Payne 2019), which supports up tracks, and attribute controls, respectively.
to 10 tracks and any subset of the 10 available instru-
ments. However, there are significant differences between
MuseNet and MIDI-GPT. MuseNet uses separate NOTE ON ably, some important distinctions must be made in our con-
and NOTE OFF tokens for each pitch on each track, plac- text. In contrast to inpainting a section of an image, where
ing inherent limitations on the number of tracks that can the exact location and number of pixels to be inpainted are
be represented, as the token vocabulary size cannot grow defined before generation, when infilling a section of music
unbounded(Fradet et al. 2023a). Considering that MuseNet the number of tokens to be generated is unknown. Further-
is currently the largest model in terms of the number more, in the context of multi-track music that is represented
of weights, the number of tracks is unlikely to be in- using a single time-ordered sequence where tracks are inter-
creased without altering the representation. Instead, we de- leaved, the location of tokens to be added is unknown. This
couple track information from NOTE ON, NOTE DUR, and makes bar-level and track-level infilling quite complex, di-
NOTE POS tokens, allowing the use of the same tokens for rectly motivating the representation we describe in Section
each track. Although this is a relatively small change, it 3. With tracks ordered sequentially, the location of the to-
enables us to accommodate all 128 General MIDI instru- kens to be infilled is then known. Infilling can occur at differ-
ments. Furthermore, there is no inherent limit on the num- ent levels (i.e. note-level, bar-level, track-level). Track-level
ber of tracks, as long as the entire n-bar multi-track sequence infilling is the most coarse and allows a set of n-tracks to be
can be encoded using less than 2048 tokens. Practically, this generated that are conditioned on a set of k existing tracks,
means more than 10 tracks can be generated at once depend- resulting in a composition with k + n tracks. Bar-level infill-
ing on their content. Note that the upper limit of 2048 tokens ing allows for n-bars selected across one or more tracks to
is not a limitation of the representation itself, but rather the be re-generated, and conditioned on the remaining content -
size of the model, and this limitation could be addressed with past, current, and future - both on the track(s), and all other
larger and more memory-intensive models. Both MuseNet tracks.
and MIDI-GPT do not require a fixed instrument schema,
however, MuseNet treats instrument selections as a sugges- 3 Proposed Music Tokenization
tion, while MIDI-GPT guarantees a particular instrument
In this section, we introduce two tokenizations to inter-
will be used.
pret musical compositions: the Multi-Track representation
2.2 Generation Tasks and the Bar-Fill representation. In contrast to other sys-
tems (Oore et al. 2020; Huang et al. 2019), which use
We consider four different generation tasks: unconditional NOTE ON, NOTE OFF and TIME DELTA tokens, we rep-
generation, continuation, infilling, and attribute control. Un- resent musical material using an approach which was previ-
conditional generation produces music from scratch. Be- ously employed for the Pop Music Transformer (Huang and
sides changing the data that the model is trained on, the user Yang 2020). In our Multi-Track representation, each bar of
has limited control over the output of the model. Continu- music is represented by a sequence of tokens, which include:
ation involves conditioning the model with musical mate-
rial temporally preceding the music that is to be generated. • 128 NOTE ON tokens: These represent the pitch of each
Since both unconditional generation and continuation come note in the bar.
for free with any auto-regressive model trained on a tem- • 96 TIME POSITION tokens: These represent the abso-
porally ordered sequence of musical events, all systems are lute start time (the time elapsed since the beginning of the
capable of generating musical material in this manner. Infill- bar, as opposed to the time elapsed since the last event)
ing conditions generation on a subset of musical material, of each note within the bar.
asking the model to fill in the blanks, so to speak. Although • 96 DURATION tokens: These represent the duration of
the terms infilling and inpainting are often used interchange- each note. Both the DURATION and TIME POSITION
I/O Specifications Generation Tasks
Number Number of Fixed Track-Level Attribute
System of Tracks Instruments Schema Drums Polyphony Infilling Control
Table 1: A summary of the I/O specifications and generation tasks of recently published generative music systems.
tokens range from a sixteenth-note triplet to a double for generation. The Bar-Fill representation begins with a
whole note in sixteenth-note triplet increments. START FILL token instead of a START token. The Multi-
Track representation is simply a special case of the Bar-Fill
We delimit a bar with BAR START and BAR END tokens.
representation, where no bars are selected for infilling.
A sequence of bars makes up a track, which is delimited by
TRACK START and TRACK END tokens. At the beginning
of each track, one of 128 INSTRUMENT token specifies its 3.1 Adding Interpretation Expressiveness
MIDI program. Tokens that condition the generation of each Multiple attempts at generating expressive symbolic music
track on various musical attributes follow the INSTRUMENT have been made either as an independent process (Gillick
token, and will be discussed in Section 4. The tracks are et al. 2019; Cancino-Chacón and Grachten 2016; Malik and
then nested within a multi-track piece, which begins with Ek 2017; Maezawa, Yamamoto, and Fujishima 2019) or a
a START token. Note that all tracks are played simultane- simultaneous process with the musical content generation
ously, not sequentially. This process of nesting bars within (Oore et al. 2020; Huang et al. 2019; Hawthorne et al. 2018;
a track and tracks within a piece is illustrated in Figure 1A. Huang and Yang 2020; Wu et al. 2022). None, however, al-
Notably, we do not use an END token, as we can simply sam- low for expressive multitrack generation. Here, we focus on
ple until we reach the nth TRACK END token if we wish to velocy, as a proxy for dynamics, and microtiming, as the
generate n tracks. This tokenization is implemented in Mid- two main aspects of expressive music interpretation. We im-
iTok (Fradet et al. 2021) for ease of use. plement two extensions to our current tokenisation allow-
The Multi-Track representation allows the model to con- ing the simultaneous generation of expressive MIDI. This
dition the generation of each track on the tracks that precede allows us to leverage the 31% of MIDI files in GigaMIDI
it, which allows for a subset of the musical material to be that have been marked as expressive (varying velocity, and
fixed while generating additional tracks. However, this rep- non-quantized micro-timing).
resentation doesn’t provide control at the bar level, except Firstly, we include 128 VELOCITY tokens that encode
in cases where the model is asked to complete the remain- every possible velocity level of a MIDI note, as velocity is
ing bars of a track. In other words, the model cannot fill in a proxi for dynamics and an important aspect of expressive-
bars that are in the middle of a track. To generate a specific ness in musical performances.
bar in a track conditioned on the other bars, we introduce Secondly, we include new tokens to represent micro-
the Bar-Fill representation. In this representation, bars to be timing. Our current tokenization allows for 96 different
predicted are replaced by a FILL IN token. These bars are TIME POSITION tokens within a bar. Therefore, this level
then placed/generated at the end of the piece after the last of quantization occurs in the model which does not capture
track token, and each bar is delimited by FILL START and microtiming. Intuitively, a solution to this problem would
FILL END tokens (instead of BAR START and BAR END be to increase the vocabulary and time resolution of the
tokens). TIME POSITION tokens. However, to maintain the possi-
Note that during training, the bars with FILL IN tokens bility of using the current downsampled and non-expressive
appear in the same order as they appeared in the original tokenization while allowing the possibility to add expres-
Multi-Track representation, shown in Figure 1B. By order- siveness, we introduce a new token DELTA which encodes
ing the bars consistently, the model learns to always out- the time difference of the original MIDI note onset ts from
put tokens in the same order as the bars that are marked the quantized token onset tk , illustrated in Figure 2a. The
k-1 k k+1 Note Tokens Delta tokens we will refer to as note duration levels. Then the 15th and
Pitch=55, Velocity=98 TIME POS=k DELTA=-1 85th percentiles of a distribution containing all note dura-
<DELTA> DELTA=|∆t| tion levels within a track are used to condition generation.
ts tk NOTE ON=55
Polyphony levels follow a similar approach. The number of
VELOCITY=98
∆t = ts − tk DURATION=1
notes simultaneously sounding (i.e. polyphony level) at each
timestep is calculated (a timestep is one 16th note triplet).
(a) Time difference between (b) Token subsequence corre- Then we use the 15th and 85th percentiles of a distribu-
the note onset and the quan- sponding to a note, with mi- tion containing all polyphony levels within a track. For both
tized TIME POS value. crotiming and velocity. these controls, we use two tokens, one to specify the lower
bound and another for the upper bound. Admittedly, this is
Figure 2: Adding expressivity to the token sequence
fuzzy range control, as strict range control would typically
use the smallest and largest values in the distribution (0th
and 100th percentiles respectively). We elected to use the
DELTA tokens encode the offset in increments of 1/160th
15th and 85th percentiles in order to mitigate the effect of
of a sixteenth-note triplet. We consider 80 additional tokens
outliers within the distribution, decreasing the probability of
because the maximal absolute time difference is half of a
exposing the model to ranges in which values are heavily
sixteenth-note triplet, and we use a DELTA -1 token when
skewed to one side of the range.
this time difference is negative. This resolution allows for
a small addition to the vocabulary, yet is enough to encode
99% of the expressive tracks of GigaMIDI. The use of ex- 5 Training MIDI-GPT
pressive tokens is illustrated in Figure 2b. We use the new GigaMIDI (Lee et al. 2024) dataset, which
builds on the MetaMIDI dataset (Ens and Pasquier 2021),
4 Controlling Music Generation to train with a split of: ptrain = 80%, pvalid = 10%, and
The premise behind attribute controls is that given a musical ptest = 10%. Our model is built on the GPT2 architecture
excerpt x, and a measurable musical attribute a for which (Radford et al. 2019), implemented using the HuggingFace
we can compute a categorical or ordinal value from x (i.e. Transformers library (Wolf et al. 2020). The configuration
a(x)), the model will learn the conditional relationship be- of this model includes 8 attention heads and 6 layers, utiliz-
tween tokens representing a(x) and the musical material on ing an embedding size of 512 and an attention window en-
a track, provided these tokens precede the musical material. compassing 2048 tokens. This results in approximately 20
Practically, this is accomplished by inserting one or more million parameters.
CONTROL tokens which specify the level of a particular mu- For each batch, we pick 32 random MIDI files (batch
sical attribute a(x) immediately after the INSTRUMENT to- size) from the respective split of the dataset (train, test,
ken (see Figure 1), and before the tokens which specify the valid) and pick random 4 or 8-bar multi-track segments
musical material. As a result, our approach is most certainly from each MIDI file. For a segment with n tracks, we pick
not limited to the specific musical attributes we discuss be- k tracks randomly selecting a value for k on the range
low, and can be applied to control any musical feature that [2, min(n, 12)]. With 75% probability, we do bar infilling on
can be measured. We employ three approaches to control a segment where we mask up to 75% of the bars. The num-
musical attributes of the generated material: categorical con- ber of bars is selected uniformly from values on the range
trols, which condition generation on one of n different cate- [0, ⌊ntracks ∗ nbars ∗ 0.75⌋]. Then, we randomly transpose
gories; value controls, which condition generation on one of the musical pitches (except for the drum track, of course)
n different ordinal values; and range controls, which condi- with a value for the range [−6, 5]. Each time we select a n-
tion the system to generate music wherein a particular mu- bar segment during training, we randomly order the tracks
sical attribute has values that fall within a specified range. so that the model learns each possible conditional ordering
Instrument control is an example of a categorical control, between different types of tracks. The model is trained to
as one of 128 different instrument types can be selected. We predict bar, track, and instrument tokens. As a result, when
use a value control for note density, however, the density cat- generating a new track, the model can select a sensible in-
egories are determined relative to the instrument type, as av- strument to accompany the pre-existing tracks, thus learning
erage note density varies significantly between instruments. instrumentation.
For each of the 128 general MIDI instruments, we calculate We train with the Adam optimizer, a learning rate of 10−4 ,
the number of note onsets for each bar in the dataset. We without dropout. Training to convergence typically takes 2-3
divide the distribution for each instrument σ into 10 regions days using 4 V100 GPUs. We pick the model with the best
with the range [P10i (σ), P10(i+1) (σ)) for 0 ≤ i < 10, where validation loss.
Pn (σ) denotes the nth percentile of the distribution σ. Each
region corresponds to a different note density level for a par- 6 Sampling with MIDI-GPT
ticular instrument. To achieve syntactically valid outputs from the system (with
We choose to apply range controls to note duration and respect to the tokenization used), we incorporate specific
polyphony. Each note duration (d) is quantized as ⌊log2 (d)⌋. masking constraints. More precisely, we mask select tokens
The quantization process groups note durations into 5 dif- during various stages of the model’s inference process to
1 1 1 1
ferent bins [ 32 , 16 ), [ 16 , 8 ), [ 18 , 14 ), [ 14 , 12 ), and [ 21 , 11 ), which preserve the sequence necessary for encoding, decoding,
ducted to evaluate the integration of a previous version of
MIDI-GPT into a popular digital audio workstation. The
study measured usability, user experience, and technology
acceptance for two groups of experienced composers: hob-
byists and professionals with convincing results. Since we
(a) Hamming distance have already conducted a comprehensive user study, we do
not repeat a listening study here. Instead, our experiments
are designed to address other aspects that impact the usabil-
ity of the system in real-world settings.
In the following, our evaluation of MIDI-GPT gauges the
performance of the system by addressing the following re-
search questions:
(b) Jaccard Index 1. Originality: Does MIDI-GPT generate original varia-
tions or simply duplicate material from the dataset?
Figure 3: The percentage of generated excerpts (gi ) with a
Hamming distance (resp. Jaccard Index J (oi , gi )) between 2. Stylistic Similarity: Does MIDI-GPT generate musical
any excerpt oi from the training dataset and gi on the range material that is stylistically similar to the dataset (i.e.,
[a,b). A Hamming distance (resp. Jaccard Index) of distance well-formed music)?
0 (resp. of 1) indicates two excerpts are identical, while 1 3. Attribute Controls: How effective are density level,
(resp. 0) indicates they are very different. polyphony range, and note duration range controls?