0% found this document useful (0 votes)
17 views9 pages

Neo-Riemannian Theory For Generative Film and Videogame Music

MIDI-GPT is a generative model based on the Transformer architecture designed for computer-assisted music composition, allowing for the infilling of musical material and conditioning on various attributes. It employs a unique representation of musical material that enhances usability and expressiveness, avoiding duplication of training data while generating stylistically similar music. The system supports various real-world applications and offers multiple generation tasks, including unconditional generation, continuation, infilling, and attribute control.

Uploaded by

Tyler Singletary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Neo-Riemannian Theory For Generative Film and Videogame Music

MIDI-GPT is a generative model based on the Transformer architecture designed for computer-assisted music composition, allowing for the infilling of musical material and conditioning on various attributes. It employs a unique representation of musical material that enhances usability and expressiveness, avoiding duplication of training data while generating stylistically similar music. The system supports various real-world applications and offers multiple generation tasks, including unconditional generation, continuation, infilling, and attribute control.

Uploaded by

Tyler Singletary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MIDI-GPT: A Controllable Generative Model for Computer-Assisted Multitrack

Music Composition
Philippe Pasquier1 , Jeff Ens1 , Nathan Fradet1 , Paul Triana1 , Davide Rizzotti1 ,
Jean-Baptiste Rolland2 , Maryam Safi2
1
Metacreation Lab - Simon Fraser University, Vancouver, Canada
2
Steinberg Media Technologies GmbH, Hamburg, Germany
[email protected]
arXiv:2501.17011v2 [cs.SD] 4 Feb 2025

Abstract alternative representation for multi-track musical material


(Ens and Pasquier 2020), resulting in an expressive and
We present and release MIDI-GPT, a generative system
steerable generative system. We outline the ongoing real-
based on the Transformer architecture that is designed for
computer-assisted music composition workflows. MIDI-GPT world usage of MIDI-GPT and provide quantitative evi-
supports the infilling of musical material at the track and bar dence to support our claim that MIDI-GPT rarely duplicates
level, and can condition generation on attributes including: the training data as the length of the generated material in-
instrument type, musical style, note density, polyphony level, creases; generates musical material that retains the stylistic
and note duration. In order to integrate these features, we em- characteristics of the training data; and that attribute control
ploy an alternative representation for musical material, creat- methods are an effective way to steer generation.
ing a time-ordered sequence of musical events for each track
and concatenating several tracks into a single sequence, rather
than using a single time-ordered sequence where the musical 2 Background
events corresponding to different tracks are interleaved. We Considering our interest in developing a system that is well
also propose a variation of our representation allowing for suited to practical and interactive computer-assisted compo-
expressiveness. We present experimental results that demon- sition applications, we must identify the factors that enhance
strate that MIDI-GPT is able to consistently avoid duplicating
the real-world usability of generative music systems. Our
the musical material it was trained on, generate music that is
stylistically similar to the training dataset, and that attribute first design decision is to use the General MIDI format as
controls allow enforcing various constraints on the generated input and output given that it is the most supported symbolic
material. We also outline several real-world applications of music encoding standard. We consider two main categories:
MIDI-GPT, including collaborations with industry partners I/O specifications, which place restrictions on the musical
that explore the integration and evaluation of MIDI-GPT into material that can be processed and generated by the sys-
commercial products, as well as several artistic works pro- tem; and generation methods, favoring using existing mu-
duced using it. sical content as prompt vs. unconditioned generation.

1 Introduction 2.1 Input/Output Specification


Recent research on generative music systems (Huang et al. We first define a track (tinst ), which is a distinct set of mu-
2019; Huang and Yang 2020; Briot, Hadjeres, and Pachet sical material (i.e. notes) played by an instrument (inst).
2019; Fradet et al. 2023a,b) has mainly focused on modeling In some cases, a track may be distinguished by its musical
musical material as an end-goal, rather than on their affor- purpose (ex. tmelody ) rather than its instrument. For exam-
dance in practical scenarios (Sturm et al. 2019). Although ple, MusicVAE (Roberts et al. 2018) aggregates all melodies
these works pave the way for efficient generative methods into a single track type, rather than distinguishing between
for music, their usability in real-world co-creative condi- melodies based on their instrumentation (piano, synth, sax-
p
tions remains limited. As a result, while we have seen a wide ophone, etc.). In what follows, tm inst , tinst and tdrum denote
adoption of generative models for language and vision tasks, a monophonic track, a polyphonic track, and a drum track,
this has not occurred to the same extent for symbolic music respectively. For example, tm bass denotes a monophonic bass
composition. In contrast, artists recently expressed their con- track.
cerns about the misuse of artificial intelligence in the music A generated excerpt can be described by a list of
p
field (Artist Rights Alliance 2024). If we want musicians to track types (ex. [tm bass , tpiano ]), and thus, we can de-
adopt generative systems, we must work on making models fine the output specification (O⋆ ) for an arbitrary gen-
controllable, able to generate content that the user will ap- erative music system as a set of track lists (ex. O⋆ =
p p p
propriate as theirs, and integrated into their existing work- {[tm
bass , tpiano ], [tpiano , tsynth ]}) . We consider a system to
flows. have a fixed schema when O⋆ contains a single tracklist. For
Motivated by these considerations, we introduce MIDI- example, CoCoNet (Roberts et al. 2018) has a fixed schema,
GPT, a style-agnostic generative model that builds on an as O⋆ = {[tm m m m
soprano , talto , ttenor , tbass ]}, meaning that the
system is only capable of generating 4 track music contain- Music piece Track Bar Bar Fill
ing soprano, alto, tenor and bass tracks. START PROGRAM=30 TIME POS=0 START FILL
In Table 1, we describe the following features of symbolic TRACK START <CONTROL> NOTE ON=55 TRACK START
music generation systems: the number of tracks, the number <TRACK> BAR START DURATION=4 PROGRAM=30
of instruments, whether a fixed schema of instruments is as- TRACK END <BAR> TIME POS=2 <CONTROL>
TRACK START BAR END NOTE ON=58 BAR START
sumed, support for drum tracks, and support for polyphony
<TRACK> BAR START DURATION=4 FILL IN
at the track-level. Clearly, reducing the restrictions on a sys- TRACK END <BAR> TIME POS=4 BAR END
tem’s output increases the usability of a system, as it can TRACK START BAR END NOTE ON=60 ...
accommodate a greater number of practices and user work- <TRACK> DURATION=2 TRACK END
flows. We leave aside style-specific generative systems (Ren TRACK END FILL START

et al. 2020; Collins and Barthet 2023; Wu and Yang 2020; <BAR>
FILL END
Liu et al. 2022; Huang and Yang 2020; Hsiao et al. 2021), as
A B
we aim for a style-agnostic system that can accommodate as
many users as possible. Figure 1: The Multitrack (A) and Bar-Fill (B) tokenizations.
As shown in Table 1, most systems either support a sin- The grey <BAR>, <TRACK> and <CONTROL> placeholders
gle track or require a fixed schema of instruments. One correspond token subsequences of complete bars, complete
exception is MuseNet (Payne 2019), which supports up tracks, and attribute controls, respectively.
to 10 tracks and any subset of the 10 available instru-
ments. However, there are significant differences between
MuseNet and MIDI-GPT. MuseNet uses separate NOTE ON ably, some important distinctions must be made in our con-
and NOTE OFF tokens for each pitch on each track, plac- text. In contrast to inpainting a section of an image, where
ing inherent limitations on the number of tracks that can the exact location and number of pixels to be inpainted are
be represented, as the token vocabulary size cannot grow defined before generation, when infilling a section of music
unbounded(Fradet et al. 2023a). Considering that MuseNet the number of tokens to be generated is unknown. Further-
is currently the largest model in terms of the number more, in the context of multi-track music that is represented
of weights, the number of tracks is unlikely to be in- using a single time-ordered sequence where tracks are inter-
creased without altering the representation. Instead, we de- leaved, the location of tokens to be added is unknown. This
couple track information from NOTE ON, NOTE DUR, and makes bar-level and track-level infilling quite complex, di-
NOTE POS tokens, allowing the use of the same tokens for rectly motivating the representation we describe in Section
each track. Although this is a relatively small change, it 3. With tracks ordered sequentially, the location of the to-
enables us to accommodate all 128 General MIDI instru- kens to be infilled is then known. Infilling can occur at differ-
ments. Furthermore, there is no inherent limit on the num- ent levels (i.e. note-level, bar-level, track-level). Track-level
ber of tracks, as long as the entire n-bar multi-track sequence infilling is the most coarse and allows a set of n-tracks to be
can be encoded using less than 2048 tokens. Practically, this generated that are conditioned on a set of k existing tracks,
means more than 10 tracks can be generated at once depend- resulting in a composition with k + n tracks. Bar-level infill-
ing on their content. Note that the upper limit of 2048 tokens ing allows for n-bars selected across one or more tracks to
is not a limitation of the representation itself, but rather the be re-generated, and conditioned on the remaining content -
size of the model, and this limitation could be addressed with past, current, and future - both on the track(s), and all other
larger and more memory-intensive models. Both MuseNet tracks.
and MIDI-GPT do not require a fixed instrument schema,
however, MuseNet treats instrument selections as a sugges- 3 Proposed Music Tokenization
tion, while MIDI-GPT guarantees a particular instrument
In this section, we introduce two tokenizations to inter-
will be used.
pret musical compositions: the Multi-Track representation
2.2 Generation Tasks and the Bar-Fill representation. In contrast to other sys-
tems (Oore et al. 2020; Huang et al. 2019), which use
We consider four different generation tasks: unconditional NOTE ON, NOTE OFF and TIME DELTA tokens, we rep-
generation, continuation, infilling, and attribute control. Un- resent musical material using an approach which was previ-
conditional generation produces music from scratch. Be- ously employed for the Pop Music Transformer (Huang and
sides changing the data that the model is trained on, the user Yang 2020). In our Multi-Track representation, each bar of
has limited control over the output of the model. Continu- music is represented by a sequence of tokens, which include:
ation involves conditioning the model with musical mate-
rial temporally preceding the music that is to be generated. • 128 NOTE ON tokens: These represent the pitch of each
Since both unconditional generation and continuation come note in the bar.
for free with any auto-regressive model trained on a tem- • 96 TIME POSITION tokens: These represent the abso-
porally ordered sequence of musical events, all systems are lute start time (the time elapsed since the beginning of the
capable of generating musical material in this manner. Infill- bar, as opposed to the time elapsed since the last event)
ing conditions generation on a subset of musical material, of each note within the bar.
asking the model to fill in the blanks, so to speak. Although • 96 DURATION tokens: These represent the duration of
the terms infilling and inpainting are often used interchange- each note. Both the DURATION and TIME POSITION
I/O Specifications Generation Tasks
Number Number of Fixed Track-Level Attribute
System of Tracks Instruments Schema Drums Polyphony Infilling Control

MIDI-GPT Any 128 no yes yes yes yes


FIGARO (von Rütte et al. 2023) 128 128 no yes yes no no
MMT (Dong et al. 2023) 64 64 yes yes yes no no
MuseNet (Payne 2019) 10 10 no yes yes no yes
MuseGAN (Hong et al. 2019) 4 4 yes yes yes no no
LahkNES (Donahue et al. 2019) 4 4 yes yes no no no
CoCoNet (Huang et al. 2018) 4 4 yes no no yes no
MusicVAE (Roberts et al. 2018) 3 3 yes yes no no no
MusIAC (Guo et al. 2022) 3 3 yes no yes yes yes
SketchNet (Chen et al. 2020) 1 no yes no no yes yes
(Pati, Lerch, and Hadjeres 2019),(Mittal et al. 2021) 1 no yes no no yes no
(Chang, Lee, and Yang 2021),(Chi et al. 2020) 1 no yes no yes yes no
(TAN and Herremans 2020),(Wang and Xia 2021),
(Wang et al. 2020) 1 no yes no yes no yes
(Haki et al. 2022), (Nuttall, Haki, and Jorda 2021) 1 no yes yes yes no no
(Huang et al. 2019) 1 no yes no yes no no

Table 1: A summary of the I/O specifications and generation tasks of recently published generative music systems.

tokens range from a sixteenth-note triplet to a double for generation. The Bar-Fill representation begins with a
whole note in sixteenth-note triplet increments. START FILL token instead of a START token. The Multi-
Track representation is simply a special case of the Bar-Fill
We delimit a bar with BAR START and BAR END tokens.
representation, where no bars are selected for infilling.
A sequence of bars makes up a track, which is delimited by
TRACK START and TRACK END tokens. At the beginning
of each track, one of 128 INSTRUMENT token specifies its 3.1 Adding Interpretation Expressiveness
MIDI program. Tokens that condition the generation of each Multiple attempts at generating expressive symbolic music
track on various musical attributes follow the INSTRUMENT have been made either as an independent process (Gillick
token, and will be discussed in Section 4. The tracks are et al. 2019; Cancino-Chacón and Grachten 2016; Malik and
then nested within a multi-track piece, which begins with Ek 2017; Maezawa, Yamamoto, and Fujishima 2019) or a
a START token. Note that all tracks are played simultane- simultaneous process with the musical content generation
ously, not sequentially. This process of nesting bars within (Oore et al. 2020; Huang et al. 2019; Hawthorne et al. 2018;
a track and tracks within a piece is illustrated in Figure 1A. Huang and Yang 2020; Wu et al. 2022). None, however, al-
Notably, we do not use an END token, as we can simply sam- low for expressive multitrack generation. Here, we focus on
ple until we reach the nth TRACK END token if we wish to velocy, as a proxy for dynamics, and microtiming, as the
generate n tracks. This tokenization is implemented in Mid- two main aspects of expressive music interpretation. We im-
iTok (Fradet et al. 2021) for ease of use. plement two extensions to our current tokenisation allow-
The Multi-Track representation allows the model to con- ing the simultaneous generation of expressive MIDI. This
dition the generation of each track on the tracks that precede allows us to leverage the 31% of MIDI files in GigaMIDI
it, which allows for a subset of the musical material to be that have been marked as expressive (varying velocity, and
fixed while generating additional tracks. However, this rep- non-quantized micro-timing).
resentation doesn’t provide control at the bar level, except Firstly, we include 128 VELOCITY tokens that encode
in cases where the model is asked to complete the remain- every possible velocity level of a MIDI note, as velocity is
ing bars of a track. In other words, the model cannot fill in a proxi for dynamics and an important aspect of expressive-
bars that are in the middle of a track. To generate a specific ness in musical performances.
bar in a track conditioned on the other bars, we introduce Secondly, we include new tokens to represent micro-
the Bar-Fill representation. In this representation, bars to be timing. Our current tokenization allows for 96 different
predicted are replaced by a FILL IN token. These bars are TIME POSITION tokens within a bar. Therefore, this level
then placed/generated at the end of the piece after the last of quantization occurs in the model which does not capture
track token, and each bar is delimited by FILL START and microtiming. Intuitively, a solution to this problem would
FILL END tokens (instead of BAR START and BAR END be to increase the vocabulary and time resolution of the
tokens). TIME POSITION tokens. However, to maintain the possi-
Note that during training, the bars with FILL IN tokens bility of using the current downsampled and non-expressive
appear in the same order as they appeared in the original tokenization while allowing the possibility to add expres-
Multi-Track representation, shown in Figure 1B. By order- siveness, we introduce a new token DELTA which encodes
ing the bars consistently, the model learns to always out- the time difference of the original MIDI note onset ts from
put tokens in the same order as the bars that are marked the quantized token onset tk , illustrated in Figure 2a. The
k-1 k k+1 Note Tokens Delta tokens we will refer to as note duration levels. Then the 15th and
Pitch=55, Velocity=98 TIME POS=k DELTA=-1 85th percentiles of a distribution containing all note dura-
<DELTA> DELTA=|∆t| tion levels within a track are used to condition generation.
ts tk NOTE ON=55
Polyphony levels follow a similar approach. The number of
VELOCITY=98
∆t = ts − tk DURATION=1
notes simultaneously sounding (i.e. polyphony level) at each
timestep is calculated (a timestep is one 16th note triplet).
(a) Time difference between (b) Token subsequence corre- Then we use the 15th and 85th percentiles of a distribu-
the note onset and the quan- sponding to a note, with mi- tion containing all polyphony levels within a track. For both
tized TIME POS value. crotiming and velocity. these controls, we use two tokens, one to specify the lower
bound and another for the upper bound. Admittedly, this is
Figure 2: Adding expressivity to the token sequence
fuzzy range control, as strict range control would typically
use the smallest and largest values in the distribution (0th
and 100th percentiles respectively). We elected to use the
DELTA tokens encode the offset in increments of 1/160th
15th and 85th percentiles in order to mitigate the effect of
of a sixteenth-note triplet. We consider 80 additional tokens
outliers within the distribution, decreasing the probability of
because the maximal absolute time difference is half of a
exposing the model to ranges in which values are heavily
sixteenth-note triplet, and we use a DELTA -1 token when
skewed to one side of the range.
this time difference is negative. This resolution allows for
a small addition to the vocabulary, yet is enough to encode
99% of the expressive tracks of GigaMIDI. The use of ex- 5 Training MIDI-GPT
pressive tokens is illustrated in Figure 2b. We use the new GigaMIDI (Lee et al. 2024) dataset, which
builds on the MetaMIDI dataset (Ens and Pasquier 2021),
4 Controlling Music Generation to train with a split of: ptrain = 80%, pvalid = 10%, and
The premise behind attribute controls is that given a musical ptest = 10%. Our model is built on the GPT2 architecture
excerpt x, and a measurable musical attribute a for which (Radford et al. 2019), implemented using the HuggingFace
we can compute a categorical or ordinal value from x (i.e. Transformers library (Wolf et al. 2020). The configuration
a(x)), the model will learn the conditional relationship be- of this model includes 8 attention heads and 6 layers, utiliz-
tween tokens representing a(x) and the musical material on ing an embedding size of 512 and an attention window en-
a track, provided these tokens precede the musical material. compassing 2048 tokens. This results in approximately 20
Practically, this is accomplished by inserting one or more million parameters.
CONTROL tokens which specify the level of a particular mu- For each batch, we pick 32 random MIDI files (batch
sical attribute a(x) immediately after the INSTRUMENT to- size) from the respective split of the dataset (train, test,
ken (see Figure 1), and before the tokens which specify the valid) and pick random 4 or 8-bar multi-track segments
musical material. As a result, our approach is most certainly from each MIDI file. For a segment with n tracks, we pick
not limited to the specific musical attributes we discuss be- k tracks randomly selecting a value for k on the range
low, and can be applied to control any musical feature that [2, min(n, 12)]. With 75% probability, we do bar infilling on
can be measured. We employ three approaches to control a segment where we mask up to 75% of the bars. The num-
musical attributes of the generated material: categorical con- ber of bars is selected uniformly from values on the range
trols, which condition generation on one of n different cate- [0, ⌊ntracks ∗ nbars ∗ 0.75⌋]. Then, we randomly transpose
gories; value controls, which condition generation on one of the musical pitches (except for the drum track, of course)
n different ordinal values; and range controls, which condi- with a value for the range [−6, 5]. Each time we select a n-
tion the system to generate music wherein a particular mu- bar segment during training, we randomly order the tracks
sical attribute has values that fall within a specified range. so that the model learns each possible conditional ordering
Instrument control is an example of a categorical control, between different types of tracks. The model is trained to
as one of 128 different instrument types can be selected. We predict bar, track, and instrument tokens. As a result, when
use a value control for note density, however, the density cat- generating a new track, the model can select a sensible in-
egories are determined relative to the instrument type, as av- strument to accompany the pre-existing tracks, thus learning
erage note density varies significantly between instruments. instrumentation.
For each of the 128 general MIDI instruments, we calculate We train with the Adam optimizer, a learning rate of 10−4 ,
the number of note onsets for each bar in the dataset. We without dropout. Training to convergence typically takes 2-3
divide the distribution for each instrument σ into 10 regions days using 4 V100 GPUs. We pick the model with the best
with the range [P10i (σ), P10(i+1) (σ)) for 0 ≤ i < 10, where validation loss.
Pn (σ) denotes the nth percentile of the distribution σ. Each
region corresponds to a different note density level for a par- 6 Sampling with MIDI-GPT
ticular instrument. To achieve syntactically valid outputs from the system (with
We choose to apply range controls to note duration and respect to the tokenization used), we incorporate specific
polyphony. Each note duration (d) is quantized as ⌊log2 (d)⌋. masking constraints. More precisely, we mask select tokens
The quantization process groups note durations into 5 dif- during various stages of the model’s inference process to
1 1 1 1
ferent bins [ 32 , 16 ), [ 16 , 8 ), [ 18 , 14 ), [ 14 , 12 ), and [ 21 , 11 ), which preserve the sequence necessary for encoding, decoding,
ducted to evaluate the integration of a previous version of
MIDI-GPT into a popular digital audio workstation. The
study measured usability, user experience, and technology
acceptance for two groups of experienced composers: hob-
byists and professionals with convincing results. Since we
(a) Hamming distance have already conducted a comprehensive user study, we do
not repeat a listening study here. Instead, our experiments
are designed to address other aspects that impact the usabil-
ity of the system in real-world settings.
In the following, our evaluation of MIDI-GPT gauges the
performance of the system by addressing the following re-
search questions:
(b) Jaccard Index 1. Originality: Does MIDI-GPT generate original varia-
tions or simply duplicate material from the dataset?
Figure 3: The percentage of generated excerpts (gi ) with a
Hamming distance (resp. Jaccard Index J (oi , gi )) between 2. Stylistic Similarity: Does MIDI-GPT generate musical
any excerpt oi from the training dataset and gi on the range material that is stylistically similar to the dataset (i.e.,
[a,b). A Hamming distance (resp. Jaccard Index) of distance well-formed music)?
0 (resp. of 1) indicates two excerpts are identical, while 1 3. Attribute Controls: How effective are density level,
(resp. 0) indicates they are very different. polyphony range, and note duration range controls?

7.1 Evaluating the Originality of Generated


and prediction tasks. For example, a scenario in which a Material
BAR END token appears after a NOTE ON token token is not Intra-Dataset Originality It is increasingly important to
feasible, given that the DURATION and NOTE POSITION quantify the frequency with which a generative system is
have yet to be determined. In such instances, we mask the producing musical material that is nearly identical to the
BAR END to prevent its sampling. Subsequently, we sam- training dataset, given potential legal issues that may arise
ple among the remaining unmasked tokens. This rule-based when these systems are deployed into the real world, and
sampling approach ensures that the model maintains a logi- the difficulty of guaranteeing that a generative system does
cal structure throughout its operations. not engage in this type of behavior (Papadopoulos, Roy, and
Pachet 2014). To accomplish this, we represent musical ma-
7 Release, Evaluation, and Applications terial in each track as a piano roll and use hamming distance
MIDI-GPT has been released1 and is seeing real-world us- to calculate the distance between two piano rolls. A piano
age in several contexts, which directly supports our asser- roll is a T × 128 boolean matrix specifying when partic-
tion that MIDI-GPT is a practical model for computer- ular pitches are sounding, where T is the number of time-
assisted composition. There are ongoing collaborations for steps. Note that when we calculate Hamming distance be-
the integration of MIDI-GPT into synthesizers, like the OP- tween two piano rolls, we normalize the distance by the size
Z by Teenage Engineering2 , and game music composition of the piano roll. Therefore, the distance between maximally
software by Elias. MIDI-GPT has been integrated into the different piano rolls would be 1.
Cubase3 digital audio workstation, the Calliope4 web appli- Since the dataset contains hundreds of thousands of
cation, and an Ableton5 plugin has been developed. MIDI- unique MIDI files, we are faced with a time complexity is-
GPT has been used to compose music6 , including two en- sue, and must employ some heuristics to speed this process
tries to the 2022 and 2023 AI Song contests, four albums up. First, rather than searching nearly identical n-bar piano
(two by Philip Tremble, one by Monobor, and one by a se- rolls, we search using single-bar piano rolls, and aggregate
lection of American and Canadian composers). It has been the results of n search processes. Note that this means that
used to compose adaptive music for games(Plut et al. 2022), if n − 1 of the bars have a match in the dataset, but one of
and a yearly artistic residency7 with French composers is the bars does not, the n-bar excerpt will not be considered to
ongoing. have a match in the dataset. However, since we are interested
A user study (Bougueng Tchemeube et al. 2023) was con- in identifying nearly identical matches, this is unlikely to
cause much of an issue. To filter out highly dissimilar candi-
1
https://www.metacreation.net/projects/mmm links to models date matches efficiently, we compute the Hamming distance
and various examples of generations. between compressed piano rolls first. Given a 48×128 piano
2
https://www.metacreation.net/projects/opz-mmm roll x that represents a single 4/4 bar of musical material, we
3
www.metacreation.net/projects/mmm-cubase discard notes outside the range [21, 109) and take the maxi-
4
https://www.metacreation.net/projects/calliope mum value over each consecutive set of 6 time-steps (equiv-
5
https://www.metacreation.net/projects/mmm4live alent to one 1/8 note) on the first axis, producing a 8 × 88
6
https://www.metacreation.net/projects/mmm-music matrix x. We calculate the Hamming distance between the
7 compressed piano rolls, discarding any candidate matches
https://vancouver.consulfrance.org/Artificial-Muse-
residency-Vancouver that have a distance greater than 0.25, and then compute
Hamming distance on the full-sized piano rolls for the re-
maining candidate matches. Even with these optimizations,
the search is executed in parallel on a 32-core machine and
takes an average of 83 seconds to complete a search for a
single 4-bar excerpt. In the worst case, it can take up to an
hour for a single query. Although we would have preferred
to use Jaccard index rather than Hamming distance, as we do Figure 4: The percentage of trials where
C⋆ ⋆
C50
in the next sub-section, the nature of the heuristics employed SÔ50
⋆ ⋆ (Ô ⋆
,
25 50C ⋆
) ≤ S ⋆ ,Ĝ ⋆ ( Ĝ ⋆
,
25 50C ⋆
). Hatching
to increase computation speed prohibited this option. 25 ,Ĝ25 Ô25 25
indicates that the binomial test was insignificant, indicating
We compute 100 trials where we randomly select a 4-
track 8-bar musical segment from the test split of the dataset, that Ô⋆ is not more similar to C than Ĝ ⋆ .
blank out n consecutive bars on a single track, and generate
(i.e. infill) a new set of bars (gi ). Given a Hamming distance
threshold, we determine if gi is nearly identical to any n-bar
excerpt in the training split of the dataset, using the method
described above. In Figure 3a we present the percentage of
trials for which the Hamming distance between any excerpts
in the training dataset and gi is on the specified range. Un-
surprisingly, as the number of bars increases, the percent- Figure 5: The percentage for each absolute difference be-
age of instances where MIDI-GPT duplicates the training tween requested and actual note density.
data decreases significantly. This correlation was expected
as shorter generations are more constrained by the surround-
ing musical content. There are not that many one-track, one- the stylistic similarity of generated material. StyleRank is
bar musical excerpts that are not already in the data set. designed to measure the similarity of two or more groups of
musical excerpts (G1 , ..., Gk ) relative to a style delineated by
Infilling Originality We ought to also measure the data
a collection of ground truth musical excerpts (C). Each mu-
reproduction for the infilling task, which may occur when
sical excerpt is represented using a set of features, described
the model predicts the exact segment that a user wants to
in detail in the original paper, and a Random Forest classifier
infill, resulting in no change and inevitable frustration from
is trained to discriminate between G1 , ..., Gk and C. Using an
the user. To measure the frequency with which this occurs,
embedding space constructed from the trained Random For-
we randomly select a 4 track 8 bar musical segment from
est classifier, the average similarity between Gi and C can be
the test split of the dataset, blank out n consecutive bars on
computed for each i. In what follows, let SGC1 ,...,Gk (a, b) de-
a single track (oi ), and generate a new set of n bars (gi ) to
replace (oi ). Then, we measure the Jaccard index between note the median similarity between a and b, calculated using
piano roll representations of oi and gi . We repeat this pro- a StyleRank instance trained on G1 , ..., Gk and C.
cess 250 times for each number of bars (n = 1, 2, 4, 8) and For this experiment, we use the same musical excerpts
report the results in Figure 3b. On the whole, as the num- from Section 7.1 (O = {o1 , ..., o250 }, G = {g1 , ..., g250 }),
ber of bars increases, the frequency with which the origi- however, we remove each pair (oi , gi ) where J (oi , gi ) ≥
nal material is duplicated decreases. Taken collectively, the 0.75, producing Ô and Ĝ. This ensures that we do not
results in this section indicate that MIDI-GPT can reliably bias our measurements by including generated material that
produce original variations when generating 4 or more bars. is nearly identical to the original preexisting material (oi )
In practice, when deployed in products, it always does as from the dataset. We also assemble a set of 1000 n-bar
we actually test that the material generated is different from segments (C) from the dataset. For each trial, we compute
C⋆ ⋆ ⋆

C50 ⋆ ⋆ ⋆
what is being replaced, and regenerate otherwise. We also SÔ50
⋆ ,Ĝ ⋆ (Ô25 , C50 ) ≤ SÔ ⋆ ,Ĝ ⋆ (Ĝ25 , C50 ), where Xn de-
re-generate tracks or bar infilling resulting in silence, which 25 25 25 25
notes a subset of X containing n elements, which are se-
the user never intends. lected randomly for each trial. In other words, for each trial,
we determine if the median similarity between two subsets
7.2 Quantifying Stylistic Similarity of the corpus is less than or equal to the median similarity
It is also important that the variations generated by the sys- between a subset of the corpus and a set of generated ex-
tem are stylistically similar to the dataset. To be clear, we cerpts. We collect the results for 100 trials and compute a bi-
define musical style as the stylistic characteristics delineated nomial test. If there is no significant difference between the
by a set of musical data. As a result, when we claim to count of trials for which the condition is true and the count
measure stylistic similarity to the training data, we are mea- of trials for which the condition is false, we can conclude
suring similarity to the style that is delineated by this set that there is not a significant difference between the gener-
of data. Consequently, we avoid having to make subjective ated material and the corpus with respect to the similarity
decisions about what constitutes a particular musical style, metric we are using. We report the results of this test using
while maintaining an evaluation framework that is generic different numbers of bars and temperatures in Figure 4.
enough to handle any arbitrary set of data. The results indicate that when generating with a tempera-
We use StyleRank (Ens and Pasquier 2019) to measure ture of 1.0, infilled generations are equivalent to the original
strate that the majority of times the absolute difference be-
tween the anticipated and actual density level is most of-
ten 0, and rarely exceeds 1. This indicates that this control
method is effective. The results for Note Duration, shown
in Figure 6a demonstrate that this attribute control method
(a) Note durations is quite effective, as the median outcome (in terms of per-
centage of note durations within the specified range) is at or
above 70% in all cases except for [ 41 , 12 ), [ 14 , 11 ) and [ 12 , 11 ). In
contrast, the Polyphony Level control is less effective, with
the median outcome lying below the 70% threshold in many
cases. Calculating polyphony level at a single time-step is
(b) Polyphony levels inherently more difficult than note duration, as the former
requires knowledge of where multiple notes start and end,
Figure 6: The percentage of note durations (a) and while the latter only requires knowing where one note starts
polyphony levels (b) within the range shown for 100 trials. and ends. This difference in difficulty seems to be reflected
in the results, as MIDI-GPT is better at controlling note du-
ration than polyphony.
preexisting material in terms of musical style (as quantified In addition to these “soft controls”, we can also imple-
by StyleRank). Our results also show that when the temper- ment hard controls through rule-based sampling. This in-
ature is greater than 1.0, the generated material is more fre- volves some bookkeeping. For example, for practical rea-
quently considered less similar to the dataset (C) than Ô, sons, we also provide a hard polyphony limit lpoly . In places
an effect that increases along with the number of generated where it would otherwise be valid for a note to be inserted
bars. This demonstrates that our measurement instrument is in the token sequence, we first check that the size of the set
sensitive enough to detect small increases in the entropy of of currently sounding pitches npitch satisfies npitch < lpoly .
the music generated by the model, which are the byproduct If npitch = lpoly , we mask note tokens.
of slightly changing the temperature.
This result does not only serve as an analytical evaluation 8 Conclusion
of the model (as human evaluations are covered elsewhere We present MIDI-GPT, a style-agnostic generative system
(Bougueng Tchemeube et al. 2023)), but future work is to released as an Open RAIL-M licenced MMM model (Ens
replicate these when conditioning on a set of files represent- and Pasquier 2020). MIDI-GPT builds on an alternative
ing a given musical style. MIDI-GPT includes a categori- approach to representing musical material, resulting in in-
cal style control based on the style metadata extracted from creased control over the generated output. We provided ex-
the MetaMIDI dataset (a subset of GigaMIDI) and aligned perimental evidence demonstrating the effectiveness of the
to the MusicMap (Crauwels 2022) ontology (Disco, Rock, system and outlined several ongoing real-world applica-
Jazz, ...). tions. The system runs on most personal computer with an
attention window of 2048 tokens, coresponding to 8-16 bars
7.3 Evaluating the Effectiveness of Attribute depending on the number of tracks and their density. How-
Controls ever, using an auto-regressive approach, and sliding win-
dows, longer parts can be generated by repeatedly condi-
MIDI-GPT allows the user to condition generation not only
tioning on portions of preexisting along with newly gener-
on the existing musical content at the time of generation,
ated material. Future work involves: optimizing the model
but also on various control attributes such as the instrument,
for real-time generation in musical agents, training larger
musical style, note density, polyphony level, and note du-
models to expand the model attention window and attend
ration. To evaluate how effective these control mechanisms
to larger musical structures, expanding the set of attribute
are, we focus on the three last ones for brevity. We conduct
controls, and continuing integration of MIDI-GPT into real-
100 trials where we generate 8-bar segments from scratch
world products and practices.
using a particular attribute control method, and measure the
difference between the anticipated outcome and the actual
outcome. For note density control, we measure the absolute Ethical Statement
difference between the density level the generation was con- While MIDI-GPT is style-agnostic, and does include more
ditioned on and the density level of the generated material. musical styles and content than any human can know, it still
For polyphony level and note duration, we compute the dis- inherits from the dataset’s bias, which only encompasses the
tribution of values (either polyphony level or note duration musical styles afforded by the MIDI notation and available
level) from the generated material, and count the percent- online. Arguably, there are still many musical styles under-
age of values that fall within the specified range. If attribute represented (mostly non-western styles) or even absent from
control is successful, we would expect at least 70% of the the dataset (i.e. not representable or available in MIDI) and
values to be within this range, as we used the 15th and 85th therefore the model (although some suspecting users have
percentiles while training. noticed its generalization capabilities). Conversely, some
The results for note density, shown in Figure 5, demon- musical styles are over-represented (e.g., pop). Further work
is needed to qualify and quantify these biases. Collins, T.; and Barthet, M. 2023. Expressor: A Transformer
Regarding copyright and intellectual property, MetaMIDI Model for Expressive MIDI Performance. In Proceedings of
and GigaMIDI were acquired under the Fair Dealing law the 16th International Symposium on Computer Music Mul-
of Canada, which limits thei use to research and non- tidisciplinary Research, 740–743. Zenodo.
commercial use. We abide by these limitations. Crauwels, K. 2022. MusicMap. https://musicmap.info/. Ac-
Regarding MIDI-GPT : at the time of writing, it is un- cessed: 2024-12-19.
clear what the legal status of such a model is, as it does not
contain the data itself, and produces original content for any Donahue, C.; Mao, H. H.; Li, Y. E.; Cottrell, G.; and
non-over-constrained request (as shown above). So far, we McAuley, J. 2019. LakhNES: Improving Multi-instrumental
restricted its use to non-commercial use. It is either released Music Generation with Cross-domain Pre-training. In Pro-
as-free-for use (Calliope), used for research purposes with ceedings of the 20th ISMIR Conference, 685–692.
collaborating companies, or used for research-creation pur- Dong, H.-W.; Chen, K.; Dubnov, S.; McAuley, J.; and Berg-
poses with selected artists. The ethical or legal implications Kirkpatrick, T. 2023. Multitrack Music Transformer. In
of current creative use by other artists, and the potential use ICASSP 2023 - 2023 IEEE International Conference on
of the released model (under an Open RAIL-M licence) for Acoustics, Speech and Signal Processing (ICASSP), 1–5.
commercial purposes by other parties does not rest with the Ens, J.; and Pasquier, P. 2019. Quantifying Musical Style:
authors of this research. We thus decline any responsibility Ranking Symbolic Music based on Similarity to a Style. In
for misuse. Proceedings of the 20th ISMIR Conference, 870–877.
Acknowledgments Ens, J.; and Pasquier, P. 2020. MMM : Exploring Condi-
tional Multi-Track Music Generation with the Transformer.
We would like acknowledge the support of the National Sci- CoRR, abs/2008.06048.
ence and Engineering Research Council (NSERC), the So-
cial Sciences and Humanities Research Council (SSHRC), Ens, J.; and Pasquier, P. 2021. Building the MetaMIDI
the Canada Council for the Arts (CCA), and Steinberg Me- Dataset: Linking Symbolic and Audio Musical Data. In Pro-
dia Technology. We also thank Griffin Page and the anony- ceedings of the 22nd ISMIR Conference, 182–188.
mous reviewers for their edits and feedback. Fradet, N.; Briot, J.-P.; Chhel, F.; El Fallah Seghrouchni, A.;
and Gutowski, N. 2021. MidiTok: A Python package for
References MIDI file tokenization. In Extended Abstracts for the Late-
Artist Rights Alliance. 2024. 200+ Artists Urge Tech Plat- Breaking Demo Session of the 22nd ISMIR Conference.
forms: Stop Devaluing Music. https://t.ly/-LbHx. Accessed: Fradet, N.; Gutowski, N.; Chhel, F.; and Briot, J.-P. 2023a.
2024-12-19. Byte Pair Encoding for Symbolic Music. In Proceedings
Bougueng Tchemeube, R.; Ens, J.; Plut, C.; Pasquier, P.; of the 2023 Conference on Empirical Methods in Natural
Safi, M.; Grabit, Y.; and Rolland, J.-B. 2023. Evaluat- Language Processing, 2001–2020.
ing Human-AI Interaction via Usability, User Experience Fradet, N.; Gutowski, N.; Chhel, F.; and Briot, J.-P. 2023b.
and Acceptance Measures for MMM-C: A Creative AI Sys- Impact of time and note duration tokenizations on deep
tem for Music Composition. In Proceedings of the Thirty- learning symbolic music modeling. In Proceedings of the
Second International Joint Conference on Artificial Intelli- 24th ISMIR Conference.
gence, IJCAI-23, 5769–5778.
Briot, J.-P.; Hadjeres, G.; and Pachet, F.-D. 2019. Deep Gillick, J.; Roberts, A.; Engel, J.; Eck, D.; and Bamman,
Learning Techniques for Music Generation. Computational D. 2019. Learning to Groove with Inverse Sequence Trans-
Synthesis and Creative Systems. Springer International Pub- formations. In Chaudhuri, K.; and Salakhutdinov, R., eds.,
lishing. Proceedings of the 36th International Conference on Ma-
chine Learning, 2269–2279.
Cancino-Chacón, C. E.; and Grachten, M. 2016. The Basis
Mixer: A Computational Romantic Pianist. In Proceedings Guo, R.; Simpson, I.; Kiefer, C.; Magnusson, T.; and Herre-
of the Late Breaking/ Demo Session, 17th International So- mans, D. 2022. MusIAC: An extensible generative frame-
ciety for Music Information Retrieval Conference (ISMIR). work for Music Infilling Applications with multi-level Con-
Chang, C.-J.; Lee, C.-Y.; and Yang, Y.-H. 2021. Variable- trol. In Proceeding of the Artificial Intelligence in Mu-
Length Music Score Infilling via XLNet and Musically Spe- sic, Sound, Art and Design: 11th International Conference,
cialized Positional Encoding. In Proceedings of the 22nd EvoMUSART 2022, Held as Part of EvoStar 2022, Madrid,
ISMIR Conference, 97–104. Spain, April 20–22, 2022, 341–356. Springer.
Chen, K.; i Wang, C.; Berg-Kirkpatrick, T.; and Dubnov, S. Haki, B.; Nieto, M.; Pelinski, T.; and Jordà, S. 2022. Real-
2020. Music SketchNet: Controllable music generation via Time Drum Accompaniment Using Transformer Architec-
factorized representations of pitch and rhythm. In Proceed- ture. In Proceedings of the 3rd Conference on AI Music
ings of the 21st ISMIR Conference, 77–84. Creativity. AIMC.
Chi, W.; Kumar, P.; Yaddanapudi, S.; Rahul, S.; and Isik, U. Hawthorne, C.; Huang, A.; Ippolito, D.; and DouglasEck.
2020. Generating music with a self-correcting non- chrono- 2018. Transformer-NADE for Piano Performances. In
logical autoregressive model. In Proceedings of the 21st IS- NeurIPS 2018 Workshop on Machine Learning for Creativ-
MIR Conference, 893–900. ity and Design.
Hong, Y.; Hwang, U.; Yoo, J.; and Yoon, S. 2019. How Plut, C.; Pasquier, P.; Ens, J.; and Tchemeube, R. 2022.
Generative Adversarial Networks and Their Variants Work: PreGLAM-MMM: Application and evaluation of affective
An Overview. ACM Compututing Surveys, 52(1). adaptive generative music in video games. In Proceedings
Hsiao, W.-Y.; Liu, J.-Y.; Yeh, Y.-C.; and Yang, Y.-H. 2021. of the 17th International Conference on the Foundations of
Compound Word Transformer: Learning to Compose Full- Digital Games, FDG 2022.
Song Music over Dynamic Directed Hypergraphs. Proceed- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
ings of the 21st AAAI Conference on Artificial Intelligence, Sutskever, I. 2019. Language models are unsupervised
178–186. multitask learners. https://cdn.openai.com/better-language-
Huang, C. A.; Vaswani, A.; Uszkoreit, J.; Simon, I.; models/language models are unsupervised multitask
Hawthorne, C.; Shazeer, N.; Dai, A. M.; Hoffman, M. D.; learners.pdf. Accessed: 2024-12-19.
Dinculescu, M.; and Eck, D. 2019. Music Transformer: Ren, Y.; He, J.; Tan, X.; Qin, T.; Zhao, Z.; and Liu, T.-Y.
Generating Music with Long-Term Structure. In 7th Inter- 2020. PopMAG: Pop Music Accompaniment Generation.
national Conference on Learning Representations (ICLR). In Proceedings of the 28th ACM International Conference
Huang, C.-Z. A.; Cooijmans, T.; Roberts, A.; Courville, on Multimedia, 1198–1206.
A. C.; and Eck, D. 2018. Counterpoint by Convolution. In Roberts, A.; Engel, J.; Raffel, C.; Hawthorne, C.; and Eck,
Proceedings of the 18th ISMIR Conference, 211–218. D. 2018. A Hierarchical Latent Vector Model for Learning
Long-Term Structure in Music. In Dy, J.; and Krause, A.,
Huang, Y.-S.; and Yang, Y.-H. 2020. Pop Music Trans-
eds., Proceedings of the 35th International Conference on
former: Beat-Based Modeling and Generation of Expres-
Machine Learning, volume 80 of PMLR, 4364–4373.
sive Pop Piano Compositions. In Proceedings of the 28th
ACM International Conference on Multimedia, MM 2020, Sturm, B. L.; Ben-Tal, O.; Monaghan, Ú.; Collins, N.; Herre-
1180–1188. mans, D.; Chew, E.; Hadjeres, G.; Deruty, E.; and Pachet, F.
2019. Machine learning research that matters for music cre-
Lee, K. J. M.; Ens, J.; Adkins, S.; Sarmento, P.; Barthet, M.;
ation: A case study. Journal of New Music Research, 48(1):
and Pasquier, P. 2024. The GigaMIDI Dataset with Features
36–55.
for Expressive Music Performance Detection. Forthcoming.
TAN, H. H.; and Herremans, D. 2020. Music FaderNets:
Liu, J.; Dong, Y.; Cheng, Z.; Zhang, X.; Li, X.; Yu, F.; and Controllable music generation based on high-Level features
Sun, M. 2022. Symphony Generation with Permutation In- via low-level feature modelling. In Proceedings of the 21st
variant Language Model. In Proceedings of the 23rd ISMIR ISMIR Conference, 109–116.
Conference, 551–558.
von Rütte, D.; Biggio, L.; Kilcher, Y.; and Hofmann, T.
Maezawa, A.; Yamamoto, K.; and Fujishima, T. 2019. Ren- 2023. FIGARO: Controllable Music Generation using
dering Music Performance With Interpretation Variations Learned and Expert Features. In 11th ICLR.
Using Conditional Variational RNN. In Proceedings of the
20th ISMIR Conference, 855–861. Wang, Z.; Wang, D.; Zhang, Y.; and Xia, G. 2020. Learn-
ing interpretable representation for controllable polyphonic
Malik, I.; and Ek, C. H. 2017. Neural Translation of Musical music generation. In Proceedings of the 21st ISMIR Confer-
Style. In NeurIPS 2017 Workshop on Machine Learning for ence, 662–669.
Creativity and Design.
Wang, Z.; and Xia, G. 2021. MuseBERT: Pre-training Music
Mittal, G.; Engel, J.; Hawthorne, C.; and Simon, I. 2021. Representation for Music Understanding and Controllable
Symbolic Music Generation with Diffusion Models. In Pro- Generation. In Proceedings of the 22nd ISMIR Conference,
ceedings of the 22nd ISMIR Conference, 468–475. 722–729.
Nuttall, T.; Haki, B.; and Jorda, S. 2021. Transformer Neural Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;
Networks for Automated Rhythm Generation. In Proceed- Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi-
ings of the International Conference on New Interfaces for son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.;
Musical Expression. Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and
Oore, S.; Simon, I.; Dieleman, S.; Eck, D.; and Simonyan, Rush, A. M. 2020. Transformers: State-of-the-Art Natural
K. 2020. This time with feeling: learning expressive musical Language Processing. In Proceedings of the 2020 Confer-
performance. Nueral Computing and Applications, 32(4): ence on Empirical Methods in Natural Language Process-
955–967. ing: System Demonstrations, 38–45.
Papadopoulos, A.; Roy, P.; and Pachet, F. 2014. Avoiding Wu, S.-L.; and Yang, Y.-H. 2020. The jazz transformer on
Plagiarism in Markov Sequence Generation. In Proceedings the front line: Exploring the shortcomings of AI-composed
of the Twenty-Eighth AAAI Conference on Artificial Intelli- music through quantitative measures. In Proceedings of the
gence, 2731–2737. 21st ISMIR Conference, 142–149.
Pati, A.; Lerch, A.; and Hadjeres, G. 2019. Learning to Tra- Wu, Y.; Manilow, E.; Deng, Y.; Swavely, R.; Kastner, K.;
verse Latent Spaces for Musical Score Inpainting. In Proc. Cooijmans, T.; Courville, A.; Huang, C.-Z. A.; and Engel,
of the 20th ISMIR Conference, 343–351. J. 2022. MIDI-DDSP: Detailed Control of Musical Perfor-
mance via Hierarchical Modeling. In 10th ICLR.
Payne, C. 2019. MuseNet. openai.com/blog/musenet. Ac-
cessed: 2024-12-19.

You might also like