0% found this document useful (0 votes)

3 views

Music Generation with NLP-3

This research article presents a novel method for automatic music generation using a modified GPT-2 model, termed MT-GPT-2, which allows for the integration of melody pitch, rhythm, and pauses in a text-like format. Additionally, it introduces a new symbolic music evaluation method that combines mathematical statistics and music theory to objectively assess generated melodies, demonstrating that the proposed model produces music closer to real compositions compared to existing models. The study highlights the challenges in current music generation techniques and the need for more objective evaluation methods in the field.

Uploaded by

Dhanush Padmanabha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Music Generation with NLP-3

Uploaded by

Dhanush Padmanabha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

PLOS ONE

RESEARCH ARTICLE

An automatic music generation and

evaluation method based on transfer learning
Yi Guo, Yangcheng Liu, Ting Zhou ID*, Liang Xu, Qianxue Zhang
Xihua University, Chengdu, China

* [email protected]

Abstract
a1111111111
a1111111111 In recent years, deep learning has seen remarkable progress in many fields, especially with
a1111111111 many excellent pre-training models emerged in Natural Language Processing(NLP). How-
a1111111111 ever, these pre-training models can not be used directly in music generation tasks due to
a1111111111
the different representations between music symbols and text. Compared with the traditional
presentation method of music melody that only includes the pitch relationship between sin-
gle notes, the text-like representation method proposed in this paper contains more melody
information, including pitch, rhythm and pauses, which expresses the melody in a form simi-
OPEN ACCESS
lar to text and makes it possible to use existing pre-training models in symbolic melody gen-
Citation: Guo Y, Liu Y, Zhou T, Xu L, Zhang Q eration. In this paper, based on the generative pre-training-2(GPT-2) text generation model
(2023) An automatic music generation and
and transfer learning we propose MT-GPT-2(music textual GPT-2) model that is used in
evaluation method based on transfer learning.
PLoS ONE 18(5): e0283103. https://doi.org/ music melody generation. Then, a symbolic music evaluation method(MEM) is proposed
10.1371/journal.pone.0283103 through the combination of mathematical statistics, music theory knowledge and signal pro-
Editor: Marcelo Mendoza, Pontificia Universidad cessing methods, which is more objective than the manual evaluation method. Based on
Catolica de Chile, CHILE this evaluation method and music theories, the music generation model in this paper are
Received: April 26, 2022 compared with other models (such as long short-term memory (LSTM) model,Leak-GAN
model and Music SketchNet). The results show that the melody generated by the proposed
Accepted: March 1, 2023
model is closer to real music.
Published: May 10, 2023

Peer Review History: PLOS recognizes the

benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The Introduction
editorial history of this article is available here: Music is everywhere in our lives. The rapid development of all walks of life has boosted the
https://doi.org/10.1371/journal.pone.0283103
demand for music. The task of automatic music generation has attracted widespread attention.
Copyright: © 2023 Guo et al. This is an open Automatic music generation methods can be divided into symbolic music generation and
access article distributed under the terms of the audio music generation. In this paper, we mainly study the generation of symbolic melody in
Creative Commons Attribution License, which
symbolic music generation.
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
Most music categories, such as pop music, are composed of melody and harmony. The mel-
author and source are credited. ody is the key to determine whether the music is good or not, and it is a linear continuous
signal. The melody has information such as rhythm and pause, which makes the melody gen-
Data Availability Statement: All midi files are
available from the POP909 database, Wang Z,
eration a challenging task. With the rising deep learning [1], various types of neural network
Chen K, Jiang J, et al. Pop909: A pop-song dataset models have been used in automatic melody generation. In some recent studies, such as [2–4],
for music arrangement generation[J]. arXiv the use of neural network model greatly simplifies the problem of melody generation. Some

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 1 / 21

PLOS ONE Music generation and evaluation method

preprint: 2008.07142, 2020. (https://doi.org/10. methods such as [5] generates melody and rhythm separately. Such representation method can
48550/arXiv.2008.07142). not well capture the characteristics of the melody.
Funding: Xihua University Graduate Innovation Refer to the way the text is generated, in theory, we can combine the melody pitch of the
Fund, ycjj2020118, YL, Intelligent Terminal Key notes with the melody rhythm, and express musical melody in a form of text, so as to integrate
Laboratory of SiChuan Province, SCITLAB-1021, the pitch, rhythm and other information into the text.
YG, the National Natural Science Foundation of
At the same time, we found that there are various problems in the existing melody genera-
China under Grantı̈61973257, 61901394ı̈ı̈ YG.
tion. For example, in LSTM model [6], the length of the learnable melody is relatively large,
Competing interests: The authors have declared and it is usually impossible to generate long-term music melody. In order to solve this prob-
that no competing interests exist.
lem, we noticed several large-scale pre-training models that work well in text generation (such
as BERT [7], GPT-2 [8], etc.), but there is a data representation problem when apply these pre-
training models in automatic music generation. Due to the expression limitation of musical
melodies, the musical melody expression method we proposed above comes in handy.
We propose a textual music melody generation method based on GPT-2 model, which
transfers the large-scale pre-training model GPT-2 to the task of music generation, and it can
generate long-term time series melody and simplify the music generation. And this method
can also be quickly transplanted to other large text generation model, so that it can be trained
and generated quickly.
Secondly, objective music melody evaluation methods have always been a gap in automatic
music generation. For example, in [3, 5, 9], the evaluation heavily relies on people’s feedback.
However, the people’s different perception of music makes such evaluation methods less prac-
tical and objective in reality. In another way, some papers such as [10] only analyzed the accu-
racy of the notes, and scarcely made any evaluation on the musicality of the melody. Some
evaluation methods of mathematical statistics are used in [11], which are still imperfect. There-
fore, there is an urgent need for a more objective and comprehensive evaluation method.
In order to solve this problem, we propose a music evaluation method that combines math-
ematical statistics and music theory knowledge to objectively evaluate the generated music
melody through the degree of note change in the melody and the wavy nature of the melody.
The results show that the melody generated by textual music melody generation method based
on GPT-2 is closer to real music than LSTM model [6],Leak-GAN model [12] and Music
SketchNet [13].

Related work
Melody generation with neural networks
Many researchers have used the traditional probabilistic generation method for automatic
music generation, mainly including N-gram or Markov model [9, 14, 15]. Later, Bettan et al.
[16] proposed a music fragment selection method, which create as new music by calculating
the similarity ranking between music fragments. On the other hand, Pachet et al. [17] used
chords to select melodies. These methods of establishing probability model are feasible to
some extent but limited. First, because of the diversity and development of music, the proba-
bility model cannot be updated in time. Second, to establish a good probability model requires
deep knowledge of music theory. Moreover, traditional methods need to design and extract a
large number of manual features, it takes a lot of manpower and time.
With the development of deep learning in recent years, deep neural network has been
applied to automatic music generation, which has solved part of the above problems. Franklin
[18] proposed the possibility of using recurrent neural network(RNN) to represent multiple
notes at the same time, so as to generate more complex music sequences, Goel [19] proposed a
polyphonic music generation method based on RNN, and this model generates multiple parts
by using Gibbs sampling method. Different from the RNN model, Sabathe et al. [4] used

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 2 / 21

PLOS ONE Music generation and evaluation method

variational auto-encoder(VAEs) to learn the distribution of music fragments. In addition,

Yangetal used generative adversarial networks(GANs) [20] to generate music and used ran-
dom noise as input to regenerate melody. Although a great deal of work has been done on
music generation, many musical elements have not been fully considered, such as note dura-
tion, note pause, music style, music format of input model, etc. Due to the differences in the
representation of music signal and text, the pre-training model of text generation has not been
transfered to music generation. Also, the advantage of deep learning in automatic music gener-
ation is limited by the scarcity of good music data sets.
The preprocessing methods for music data are not unified and standardized. In music gen-
eration based on note representing, the processing method of musical notes is relatively diffi-
cult, and the melody pitch (such as C4, C4, G4, G4 and others) is directly extracted, neglecting
the rhythm of melody (duration of notes, pause, etc.). For example, in [3, 5], Hongyuan Zhu
proposed the model of cross generation of melody and rhythm (training pitch and rhythm sep-
arately), although taking into account pitch and rhythm, it is still not compatible in training
and generation. such as the interactive melody generation method based on the VAEs [2].The
music extraction in [13, 21] uses the customized frame number to extract music, which may
cause the loss of music information.

Music evaluation
Music evaluation method has always been a vacancy in the study of automatic music genera-
tion. In some papers, such as [11], volunteers were invited to test the quality of music genera-
tion, but everyone has different tastes towards music, this method is rather subjective. It is
necessary to establish an objective music evaluation index to assess the quality of the generated
music. In some papers, such as [10], only the model accuracy rate (the accuracy of predicting
the next note based on previous notes, etc.) was analyzed for the results of music generation.
But these tests are not accurate for the generated music, which values “quality” over “quantity”
and “new” over “old”. Wilcoxon test (signed rank test), Mann-Whitney U test and other math-
ematical statistical methods, to test whether the data comes from the same population [22].
Most of these existing methods are subjective, and mathematical statistic method is objective
but not competent, because they are not integrated with the characteristics of the music itself.
Therefore, there is a urgent need for a more objective method with resonable indicators to test
the quality of the generated music.

Music representation
We first briefly introduce some basic forms of music data, which can familiarize the readers
who do not have a music background, then explain how to express the music melody.

Formats of music data

According to three different formats of music, we made the conversion diagram in Fig 1,
which describes three main music data formats and the method of conversion to each other.
MIDI(musical instrument digital interface) is our main research format, which is created to
solve the problem of communication in the electro acoustic musical instrument in the begin-
ning of 1980s. MIDI is the most extensive music standard format in composing, which can be
called “computer understandable score”. It uses the digital control signal of the notes to record
music. A complete MIDI music is only dozens of KB in size and contains dozens of music
tracks. Almost all modern music is composed using MIDI as a sound library. MIDI transmits
not sound signals, but events such as notes, controling parameters, it tell the MIDI device what
to do and how to do it, such as which notes to play, duration, volume, etc.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 3 / 21

PLOS ONE Music generation and evaluation method

Fig 1. Representation of music data and its conversion.

https://doi.org/10.1371/journal.pone.0283103.g001

Melody representation
To simplify the problem, we first extracted the melody sound tracks and represented them by
text drawing. In the past traditional music generation rarely notice the notes of delay and
pause when the melody was extracted. For example, in Fig 2, the duration of the two notes
(The notes are framed in red) is very small, and the direct extraction of pitch information can-
not represent the difference between notes, so it is necessary to reduce the unit duration and
use a small unit length to represent all MIDI notes. Similarly, the extraction method in Deep-
bach [23] is rigid and does not consider the pause of notes, which makes the song lacks of
rhythmic information. With reference to music theory [24], we propose the pretreatment
methods discripted in part IV.

Model
In this section, we first introduce the data preprocessing, and then the music textual-GPT-2
(MT-GPT-2) model. In the end, we introduce the music evaluation method(MEM). System
flowchart of the model is shown in Fig 3.

Fig 2. Example of melody in midi file. (Use FL Studio DAW for presentation).
https://doi.org/10.1371/journal.pone.0283103.g002

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 4 / 21

PLOS ONE Music generation and evaluation method

Fig 3. System flow chart.

https://doi.org/10.1371/journal.pone.0283103.g003

A novel method of musical melody text representation

The flow of the novel data preprocessing is shown in the Fig 4.
A music textual method is designed to guide the automatic music generation. Converting
melody into one-dimensional music text signal makes it possible to apply the text processing
method to music without affecting the music information.

Fig 4. Flow chart of data preprocessing.

https://doi.org/10.1371/journal.pone.0283103.g004

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 5 / 21

PLOS ONE Music generation and evaluation method

The preprocessing makes the music representation consistent with the input of most text
generation models, which improves the current problems of different models input in auto-
matic music generation and difficulties in model design. We transfer the musical mode and
adjust the speed. Secondly, we extract the pitch and rhythm information of the notes to get the
musical melody text. The steps are as follows:
Transpose. The purpose of this step is to convert all MIDI music to the same mode and
the same speed, because such data normalization is beneficial for model learning. First, by cal-
culating the distance between the modality of this MIDI and the key of C, we convert all the
notes to the key of C by tonal conversion. After that the tonics of all music files are C, and the
rest of the melody changes around the tonic in a regular manner. Then, we set the speed of all
MIDI music to 90, which allows all MIDI music to play at the same speed.
Extract note length information. We obtain the duration T0 of the shortest note in the
melody score, and use T0 as the basic unit length to extract the duration information of all
notes, which solves the problem of missing duration information in the conversion process.
The specific steps for extracting note duration information are as follows:
• Specify the MIDI music file and extract the melody track from Track1;
• Get the list by traversing all the notes in it and extracting the duration L1, Get the smallest
duration T0;
• After obtaining the duration T0, divide the duration of all notes by T0 to obtain the note mul-
tiple Ni of all notes for the smallest note, so as to facilitate the textual representation below.

Extract pitch and rhythm information. According to the previously extracted unit length
T0 as the minimum length, we can extract notes, note duration, and pause duration at the
same time. The extracted sample is shown in Fig 5. For notes, we directly use letters to repre-
sent them, starting from the center C of the piano, The center C is called C4. Pushing to the
right in turn is C4 D4 E4 F4 G4 A4 B4 (white key), after B4, pushing to the right is C5 D5 E5
F5 G5 A5 B5, where E-4 is the black key between D4 and E4. The key can be represented by E-
4 (-: flat note) or D#4 (#: sharp note) that is flattened to E4.
Rhythm information indicates the duration and pause. The note “-” is used here to indicate
that the previous note is extended according to our unit time (the shortest note). The “^” sym-
bol is used here to indicate the duration of the pause. In this way, there is not any missing
information for the representation of the melody.
The steps of textual music melody are as follows:
• Specify the MIDI music file and extract the melody track from Track1;
• Starting from the first note, as shown in Fig 5 above, extract the pitch of the first note C4;

Fig 5. Example of textual conversion, the graph of the upper part is the musical melody, and the bottom is the
corresponding textual melody text.
https://doi.org/10.1371/journal.pone.0283103.g005

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 6 / 21

PLOS ONE Music generation and evaluation method

• Determine whether the duration information of the note is a unit length. If yes, we start to
judge the next note. If it is not a unit length, fill in the duration information based on the
unit length information Ni obtained above, and use the symbol “-” to indicate the duration
information of the note. For example, the following C4, the obtained Ni is 4 units of dura-
tion, add 4 “-” at the end;
• If the next note is a pause, use the pause duration divided by T0 to get the multiple of the
unit duration T0. As shown in the figure below, the pause obtained after “D3- -” is 2 units of
duration, so fill in two “^” symbols.

GPT-2 based on music textualization(MT-GPT-2)

For the Large pre-training model, we choose to use the GPT-2 model [8]. based on GPT-2
model and transfer learning we propose MT-GPT-2(music textual GPT-2) model. This section
introduces the GPT-2 model and attention mechanism, then introduces the model used and
the regulating part of the model.
Large-scale natural language processing models such as BERT [7], Transformer XL [25],
and XLNet [26] have taken turns to set records in various natural language processing tasks.
Among them, GPT-2 has attracted the attention of the industry due to its stable and excellent
performance. GPT-2 has an amazing performance in text generation, and the generated texts
exceed people’s expectations in terms of contextual coherence and emotional expression. In
terms of model architecture alone, GPT-2 does not have a particularly novel architecture. It is
very similar to the transformer [27] model with only a decoder, but because of the previous
data format problems, it is still lacking in automatic music generation. So transfering it to
music generation is worth studying.
The complete block diagram of the GPT-2 model is shown in the Fig 6.
Y
n
PðxÞ ¼ Pðsn js1 ; :::; sn 1 Þ ð1Þ
i¼1

This method is convenient to estimate P(x) and P(sn−k, . . ., sn|s1, . . ., sn−k−1) for any condi-
tions. In simple terms, it is to use the previous data to predict the following data. A break-
through of GPT-2 over other pre-training models is that it can perform well in downstream
tasks such as reading comprehension, machine translation, question and answer, and text
summarization without training for specific downstream tasks. And unsupervised language
modeling can learn the features needed for supervised tasks.
For the unsupervised corpus words U = (ui−k, . . ., ui−1) provided in the data, we use the
model object to maximize the following:
X
Li ðuÞ ¼ logPðui jui k ; :::; ui 1 ; yÞ ð2Þ
i

In the above formula, k is the size of the context window, and the conditional probability P
is modeled by the neural network parameter θ. These parameters were trained using stochastic
gradient descent.
In our experiment, we used a multi-layer transformer decoder as a language model, this is a
variant of Transformer [25]. This model uses a multi-headed self-attention operation and then
passes through the position-wise forward feedback network to generate the output distribution
of target words:
h0 ¼ UWe þ Wp ð3Þ

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 7 / 21

PLOS ONE Music generation and evaluation method

Fig 6. GPT-2 model flow chart.

https://doi.org/10.1371/journal.pone.0283103.g006

hl ¼ transformerblockðhl 1 Þ8l 2 ½1; n� ð4Þ

PðuÞ ¼ softmaxðhn WeT Þ ð5Þ

U = (ui−k, . . ., ui−1)is the context vector of the word, n is the number of levels, We is the
word embedding matrix, and Wp is the position embedding matrix.
Self-attention mechanism is to find the influence of each other words on the current word
input in a paragraph. For example, “I have a dog, and it is very good”. When dealing with “it”,
the word “dog” has the greatest impact on “it”. The attention mechanism is proposed in the

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 8 / 21

PLOS ONE Music generation and evaluation method

paper [27], and the self-attention mechanism formula is as follows:

!
QK T
AttentionðQ; K; VÞ ¼ softmax pffiffiffiffi V ð6Þ
dk

The Q (Query) vector is the representation of the current word and is used to score all other
words (using their keys). The K (key) vector is similar to the labels of all words in the segment.
They are the objects we want to match when searching for related words. The V (value) vector
is the actual word representation.
Masked Self-Attention means that when we process the input sentence and calculate the
influence of the attention mechanism, we only pay attention to the current word and the
words input before the current word, and mask the words after the word. The advantage is
that it can extract the attention weight in one direction, remove the influence of subsequent
words, and prevent the occurrence of exactly the same information for generation. This
method reduces the accuracy of the model, but in the case of music generation, it avoids pro-
ducing the same music.
Masked Muilti Self Attention means that when performing matrix operations, the extracted
Q,K,and V vectors are cut and then operated separately, then the results are added. The advan-
tage of this is that each can be extracted more comprehensively attentional influence between
words.
The adjustment part is shown in Fig 7.
As shown in Fig 7, the MT-GPT-2 generation model is based on our specific preprocessing
method. The construction of a note dictionary is added before the training model. The note is
converted to melody text file and extract all the elements in it to make a dictionary, which will
be used for model training. And make interface changes to the input and output of the model.
To transfer GPT-2 model to automatic music generation, we use the simplest and most
direct method, it is to text the music data, so that GPT-2 model can directly use these data.
However, due to the difference between music text and text, the model needs to be adjusted.
For the textual data we produced, we constructed a dictionary of notes, then sorted and de-
stressed all the notes that appeared. The Tokenizer method in the BERT model [7] is used to
get a note dictionary, and the data set is divided into 100 small data subsets.
Compared with the GPT-2 model, the parameters in the MT-GPT-2 model should be
adjusted. First of all, the original dimension of the word vector is set to 1024 that is too large

Fig 7. GPT-2 model fine-tuning.

https://doi.org/10.1371/journal.pone.0283103.g007

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 9 / 21

PLOS ONE Music generation and evaluation method

for musical notes, which is not much for articles, but too large for musical notes. There are
only 144 musical notes on a piano key, so it is necessary to shrink the dimension. Secondly, we
increase the number of heads of attention mechanism, because there are more features of
musical melody than text. Increasing the number of heads is necessary to learn the relation-
ships between notes in multiple ways. For each layer in the neural network, we increase the
width because the length of a song is longer than that of text, and increasing the width can
enhance the model’s learning for long musical sequences.
For the output of the model melodies, some adjustments are made to the generated musical
melody. Any note or melody fragment can be inputed as driving note, based on which the sub-
sequent notes are generated by the model. The length of the generated notes is set to 1024, so
that the duration of the generated music melody is about 150 seconds. The model generates
one note at a time, and one of the highest 8 notes is selected, which increases the randomness
and innovation of the melody.

Music evaluation method(MEM)

The framework of the music evaluation model(MEM) is shown in the Fig 8.
Compared with image generation, both of them emphasize “true”, while music generation
emphasizes “quality” rather than “quantity”, and “new” rather than “old”. “New” requires the
model to be creative instead of repeating the learned segments all the time. The existing evalua-
tion of “quality” is very subjective. In order to ensure the novelty of the generated melody, we
evaluate these melodis in terms of mathematical statistics and music theory knowledge [28].
The evaluation model-MEM (Music evaluation model) propoesed in this paper is divided into
the following mathematical statistics test and music theory evaluation.
Mathematical statistics test. Mathematical statistics [29] is an applied mathematics sub-
ject that conducts quantitative research on the results of a limited number of observations or
experiments on random phenomena, and makes certain reliable inferences based on the over-
all quantitative regularity. It is the study of how to effectively collect, organize, and analyze
randomized data that aims to make inferences or predictions about the problems under inves-
tigation, and it provides a basis and basis for certain decisions and actions.
Combing statistical methods used here and below are concerned, they all study whether the
generated music (random sequence An) belongs to the real music population (R) in the data
set. Our goal is to see whether the generated music also obeys the same overall distribution.
If these melodies follow the same overall distribution, this illustrates the the generated music
and the real music are statistically similar from a mathematical point of view. Wilcoxon Test,
Mann-Whitney U Test and Kruskal-Wallis H Test were used to test the melodies respectively.

Fig 8. Block diagram of music evaluation model.

https://doi.org/10.1371/journal.pone.0283103.g008

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 10 / 21

PLOS ONE Music generation and evaluation method

i) Wilcoxon test. In the Wilcoxon signed rank test, it adds the rank of the absolute value of
the difference between the observation value and the center position of the null hypothesis. It
is suitable for pairwise comparison in the T test, and it does not require the difference between
the paired data to obey a normal distribution, but a symmetrical distribution. Test whether the
difference between paired observations comes from a population with a mean of 0 (whether
the population that produces the data has the same mean).
Both the positive and negative sign test and the Wilcoxon signed rank test can be regarded
as substitutes for the parametric T test for paired observations. The nonparametric test has the
advantage of not needing to make assumptions about the overall distribution. For the paramet-
ric T test of paired observations, it must be assumed that the overall difference is normally
distributed.
The steps of the method are as follows:
• We assume H0: There is no significant difference between the generated music and the real
music. H1: There is a significant difference between the generated music and the real music.
• Taking into account the differences in musical modes, which vary greatly from song to song,
we subtracted the preceding note from the number corresponding to the pitch of each note
to get a sequence Di that records the variation of the notes.
• We use the Wilcoxon Test function in the scientific computing package in Python to com-
pare the generated melodic sequence D1 with the actual musical sequence D2 to see if there is
any significant difference.
• Get the p value.
Calculate the P value, and then specify the significance level α (generally 0.05). When P >
α, we accept H0 and reject H1, thinking that the generated music is not significantly different
from the real music.
ii) Mann-Whitney U test. Mann-Whitney U test, also known as Mann-Whitney rank sum
test, can be regarded as the T test of the parameter test method of the difference between two
means or the corresponding large sample normal test substitutes. Since the Mann-Whitney
rank sum test explicitly considers the rank of each measured value in each sample, it uses more
informations than the sign test method.
The steps of the detection method are as follows:
• Similar to the the Wilcoxon Test, we test whether the generated music is significantly differ-
ent from the real music, that is, hypothesis H0: There is no significant difference between the
two music, and H1: There is a significant difference between the two music.
• Preprocess the obtained music sequence to obtain the corresponding generated music mel-
ody sequence D1 and real music melody D2.
• Use the Mann-Whitney U test function in the scientific computing package in Python to get
the U value.
Calculate U according to the above steps, and compare with the critical value Uα (α is gen-
erally 0.05, Uα can be obtained by looking up the table), when U � Uα, reject H1 and accept
H0, which is no significant difference between generated music and real music.
iii) Kruskal-Wallis H test. In practice, it is often necessary to compare the differences
between the means of multiple independent data. Existence problems sometimes affect the
results of more than one factor. In this way, it is necessary to carry out combined experiments
and repeated sampling of various factors at different levels. When the test error is too large, it

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 11 / 21

PLOS ONE Music generation and evaluation method

is not conducive to comparing the difference, because too many samples cannot be allowed in
a combination.
In addition, it is also necessary to consider that the data in a group should meet the homo-
geneity. When extract data, it is necessary to consider how to better design the experiment
according to the randomness of the data source. Kruskal-Wallis test is a method that extends
the Wilcoxon-Mann-Whitney test of two independent samples to three or more group tests.
We assume H0: There is no significant difference between the generated music and the
real music and H1: There is a significant difference between the generated music and the real
music.
Calculate the p value through the scientific computing package in Python, and then specify
the significance level (generally 0.05). When p > α, we accept H0 and reject H1, thinking that
the generated music is not significantly different from the real music.
Music theory evaluation. The current evaluation of automatic music generation doesn’t
take the music theory into account. In other words, it is not comprehensive to the evaluation
of only the accuracy of the model and mathematical statistics. Therefore, servral indicators for
music evaluation are proposed in this paper, including smooth-saltatory progression compari-
son, wave test and note-level mode test.
i) Smooth-saltatory progression comparison. In terms of music harmony, there is the term
“smooth progression”. One degree, two degrees and three degrees of progression are called
“smooth progression”. Four degrees and five degrees or higher degrees are called “saltatory
progression”.
Progression is a form of melody progress, and it refers to the ascending or descending of a
second interval between two adjacent notes in a melody in the order of the scale. For example,
there are two notes where the first note is C4 and the second note is D4, the interval between
them is called the major second, such a transitional relationship between two notes is referred
to in this paper as “smooth progression”. If the first note is C4 and the second is F4, the interval
between them is a perfect fourth, and we call it the “saltatory progression”. The upward direc-
tion is generally stronger and stronger, which means to emphasize the theme of the song,
while the downward direction is more peaceful.
We perform a digital representation of the generated music, and then perform a difference
operation. First we get the note sequence a1, a2, a3, . . ., an, where an+ 1 = an + d(n = 1, 2, 3, . . .,
n), d is a constant, and is called tolerance. Therefor, we can get the sequence of d:

Dðan Þ ¼ anþ1 an ð7Þ

After obtaining the sequence D(an), we pick out the number of values whose absolute value
is bigger than 4 (representing four degrees), and compare it with the number of values less
than or equal to 4 to get the smooth-saltatory progression ratio:

ðDðan Þ � 4Þ
q¼ ð8Þ
ðDðan Þ < 4Þ

In addition, the degree of progression of the transformation can also be analyzed through
the spectrum after the discrete fourier transform.
ii) Wave test. The progression of musical melodies is up and down, and appears to be a pos-
itive, flowing waveform in outline. In the progression of the melody the notes always start on a
steady note, then go up or down, and finally return to the steady note. Therefore, we detect the
waveform in progress of the generated melodies, and then compare them with the waveform
of the real melodies to see whether there is an obvious difference, so as to judge the quality of
the generated melody.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 12 / 21

PLOS ONE Music generation and evaluation method

According to the shape of the melody, it is roughly divided into seven shapes [28].
• Big mountain type (big wave type): it is composed of a large interval and a wide range of ups
and downs. It is often associated with a certain kind of lofty feelings, broad singing, magnifi-
cent and heroic personality.
• Rising type: continuous upward melody often expresses high and agitated emotions.
• Drop type: continuously descending melody is often used to express emotions that change
from tension to relaxation.
• Repeat type: the horizontal melody of homophonic repetition often plays a background role,
or it is like retelling in an opera, to exaggerate its specific atmosphere.
• Stress type: the interrogative melody with the highest note at the end often produces a “ques-
tioning” effect similar to the questioning tone.
• Zigzag style: the melody has a small range of ups and downs, and the crest period is short,
and it often fluctuates rapidly from a small interval in a not wide range. It can make the emo-
tions appear vivid and lively.
• Surround type: surround sound moves up and down around the center tone.
The detection of waveform is slightly simplified in this paper. The difference sequence
D(an) we got above is judged by the positive and negative relationship. A positive number rep-
resents an increase in pitch, and a negative number represents a decrease in pitch. First, we
start with the first number, check the following 3 numbers to see if they have the same sign, if
they have the same sign, check the 3 numbers after the opposite sign to see whether the sign
has changed, if there are no changes, the test round ends, and it is counted as a wave. Then
start the test from the back.
Waviness test steps are as follows:
• The generated MIDI melody file is converted into a sequence Ai, i = 1, 2, . . ., n.
• Use Di = Ai+1 − Ai to get sequence Di.
• Starting from i = 1 cycle, view each element in Di, validation of the current element behind
the symbols and element symbols are the same, the same symbol illustrates three notes after
the element is the same direction, then verify whether the element and the third element of
symbols on the contrary, to verify whether the third and the fourth element with the same
symbol, the same symbol means that the next three notes starting with the third note change
in the same direction. And the direcyion is opposite to the first three.
• Set the initial value k = 0, and k + 1 if the above conditions are met.
Then compare the generated music with the number of waves in real music to make an
evaluation.
iii) Note-level mode test. The note-level mode test is used to check whether the notes of the
generated notes are all in the C major key we stipulated, from which the formula can be obtained:

Notein
d¼ ð9Þ
Noteall

Notein is the number of notes in the mode, and Noteall is the number of all notes. The higher
ratio of the notes falls on the mode to all notes, the generated melody is more consistant with
the mode.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 13 / 21

PLOS ONE Music generation and evaluation method

Experiments
In this section, we introduce the selected data set and preprocess the data. After that we illus-
trate the parameters of the model set up and the computer configuration, and then generate
the melody. The generated melody is analyzed through different starting characters, and then
compared with the LSTM model. Finally, MEM is used to evaluate the generated melodies.

Dataset
The data set we used is POP909 [30], and the music in this data set is 909 Chinese pop songs.
we used the mid files in the folder and process it by using music21 [31] in Python. The music
in this data set is clear, from which we can directly extract the music melody part through
track selection, thus greatly reducing the time of data preprocessing, and making us focus on
model training, evaluation, and construction.

Preprocessing
All the songs were shifted to C major key. Then we process 909 popular songs through the
above preprocessing method in IV to obtain a series of music melody texts. We divide the
music texts into one hundred subsets to facilitate model training. In Fig 9, there are samples of
resulting music text.

Model training
We implemented our model using Tensorflow [32] and Keras [33] as’ the backend. And used
music21 [31] to process our data, The parameters are set as follows:
• The GPT-2 model used in the experiment is set as 8 hidden layers;
• Width of each layer is 1024;
• Dimension of word embedding is 64;
• Position vector was 1024;
• Number of multi-head attention mechanism is 4;
• Initial learning rate is 0.0001;
• Input data size is 1024;
• We used RTX2060 GPU for training, and the training rounds were set as 1000rounds.
After several rounds of experiments, it is found that the melody generated by the model
trained with the above parameters is more in line with our auditory habits, and also has good

Fig 9. Sample music melody text after preprocessing, and each line represents the melody of a song.
https://doi.org/10.1371/journal.pone.0283103.g009

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 14 / 21

PLOS ONE Music generation and evaluation method

Fig 10. The change of loss value during training, the abscissa is rounds and the ordinate is loss.
https://doi.org/10.1371/journal.pone.0283103.g010

results in the MEM model test.The change of loss with the number of training rounds in
model training is shown in the Fig 10.

Melody generation
Before generating the melody, specify the length of the notes to be generated and some param-
eters(length of generated notes, randomness of generated notes, etc.), then enter some driving
characters m1, m2, . . ., mn.
When generating notes, we first input the initial notes. The length of this string of notes can
be customized. Then set up to select one of the eight notes or symbols(eight notes or symbols
with the highest probability predicted by the model) as the next one. MT-GPT-2 only generate
one note at a time and the length is 1024. Converted to MIDI music at a speed of 90, the music
lasts about 2 minutes and 30 seconds. The generated notes are converted into melody as
shown in the Fig 11.
We designed three experiments to evaluate the generated music, divided into the following
three parts: comparison with different initial notes, comparison with other models and evalua-
tion using MEM. Through the comparison of different initiators, it can confirmed the perfor-
mance of MT-GPT-2 and verify whether some characteristics of popular music melody have
been learned. Compared with the LSTM model, it can be verified that the large pre-training
model MT-GPT-2 can achieve better results in music generation. Finally, by MEM verifica-
tion, we compared it with real music and LSTM model to verify whether the MT-GPT-2
model can meet the requirements of real music melody in a more objective perspective.

Comparison with different initial notes

In order to verify whether the initial notes can guide the generation of melody, we use four dif-
ferent initial notes in Fig 9 to test the performance of the model.
As shown in Fig 11, the notes and rhythm in the red box are the notes we set in advance.
The pitch and rhythm settings in (a) and (b) are more complicated, this is to test whether the
model learned the musical characteristics of complex fragments. The pitch and rhythm settings
in (c) and (d) are very simple, in order to detect the ability of the model to generate, and to test
whether a relatively complex rhythm can be generated under a simple starting note melody.
In (a) and (b), the red box is the initial note. In the figure, we mark two blue boxes. The
framed melody segment reflects the similar rhythm and pitch to the initial segment, but in the
sound we can find the note and the rhythm are not exactly the same, the small changes are in
line with the relative relationship between the bars in popular music.
The red frame in (c) and (d) is the initial note, which is a simple note and rhythm, but
when the subsequent melody is generated, such as the green frame, you can see that it can

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 15 / 21

PLOS ONE Music generation and evaluation method

Fig 11. Examples of using the MT-GPT-2 model to generate music, the note in red box is the starting note of the input, the notes framed in blue in
(a) (b) are rhythmic pitches similar to the starting notes, while the notes framed in green in (c) (d) are rhythmic and pitch pitches unrelated to the
starting notes.
https://doi.org/10.1371/journal.pone.0283103.g011

generate the relatively complex rhythmicity, which shows that the rhythmic text is useful in
music generation.
From the music generated above, we can see that the music we generated has a series of
characteristics of real music.

Comparison with other models

In this section, we compare the LSTM model [6] with our MT-GPT-2 model, and use the
same data set and data preprocessing method, where LSTM also has 8 layers and each
layer has a width of 1024. For the training rounds, we set LSTM model of 500 rounds and
MT-GPT-2 model and compared the training results of 1000 rounds. Taking into account
the characteristic of musical melody, we analyzed the performance of the model from the
aspect of musicology.
As Fig 12 shown, (a), (b) are generated by LSTM and (c), (d) are generated by MT-GPT-2,
and the green square refers to each of the melody notes, next to the pitch for the piano keys.
Red box labeled notes for the generated melody fragment is the tonic, blue braces mark out the
pause part of the section. The note length of these (the length of the green square) correspond-
ing to the rhythm of the segment, yellow boxes mark parts of the melody that are similar before
and after, and pink arrows show the wave changes of the notes. When using the LSTM model,
because of the limitation of the generated sequence length, the time interval between the tonics
is too long, the LSTM model fails to pay good attention to the connection between the tonic
and the before and after notes. Compared with the LSTM model, the MT-GPT-2 model

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 16 / 21

PLOS ONE Music generation and evaluation method

Fig 12. MIDI files, (a) and (b) generated using the LSTM model were compared with MIDI files (c) and (d) generated using the MT-GPT-2 model.
(Use FL Studio DAW for presentation).
https://doi.org/10.1371/journal.pone.0283103.g012

showed better performance for notes with a long interval, and could notice the tonic after
several notes, and change around the tonic, which met the requirements of popular music
creation.
In terms of rhythm, in the preceding (a) and (b), the rhythm formed by the duration of
notes and pauses is somewhat disorganized, which does not have the characteristics of rhythm
for the listener’s hearing. However, in (c) and (d) the music generated by MT-GDT-2, it can
be seen that the rhythm is relatively obvious, and there is a certain repetition, which is in line
with the listeners’ auditory habits.
For melody pitch change, we use the pink arrow in the figure to mark out the changing
direction of the notes, in the (a) and (b) we can not see the intuitive waves of melody. The
rhythm of (a) is downward all the time. The rhythm of (b) starts down and ends up rising, but
the time intervals is too far, and the range of notes is too wide, it is not too conforms to the
characteristics of popular music. The melodies in (c) and (d) are obviously wavy, and the
rising and falling notes have a certain kinetic energy and the trend of change, which meets the
requirements of pop music melody in music theory.
The music shown in (c) and (d) is generated by MT-GPT-2 model. The musical melodic
fragments framed in yellow show that the melodies have obvious repeatability, which is a
music characteristic that LSTM model can not generate.
All of these aspects of the analysis show that using a large pre-training model to generate
music is better than some simple models.

Evaluation using MEM

We use mathematical statistics evaluation method and music theory evaluation method
(MEM) in IV to evaluate the quality of our model based on music textual melody generation.
Mathematical statistics. We use the MT-GPT-2 model and the LSTM model [6] to ran-
domly generate many pieces of music, compare them with the real music in the data set, and
take the average of each evaluation index as shown in Table 1.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 17 / 21

PLOS ONE Music generation and evaluation method

Table 1. Mathematical evaluation table.

Mathematical indicators Whitney U test Wilcoxon test Kruskal-Wallis H test
MT-GPT-2 0.536371415 0.789986781 0.536285857
Leak-GAN 0.486661874 0.75105026 0.489012412
Music SketchNet 0.503457679 0.691353433 0.525464573
LSTM 0.099479186 0.672383992 0.099468169
https://doi.org/10.1371/journal.pone.0283103.t001

Because the generate melodies is sometimes not on the same tone, sometimes differ octave
and so on. In this paper, We subtract the previous pitch from the note’s pitch to get a sequence
of pitch variations so that we can not be restricted by the tonic and tonal to the music, compar-
ing the similarities between such changes in series to eliminate the difference between the
pitch of the tonic. Starting from the direction and size of the change of the notes, the compari-
son between the change and the real change music can be obtained through the above three
ways of combing and testing.
We can see that among the various mathematical statistics, our model MT-GPT-2 model is
closer to real music in terms of mathematical statistics. In the U test and K test, our model
reached an evaluation of 0.5 or more. Compared with the LSTM model [6],Leak-GAN model
[12] and Music SketchNet [13], our model has better results.
Table 1 shows the comparing results through traditional mathematical statistics methods.
In the mathematical model, we use our new music representation method to convert the mel-
ody into a one-dimensional signal that greatly reduces the difficulty of music comparison and
is very convenient when using comparison methods.
Music theory statistics. The smooth-saltatory progression comparison and wave test are
creative in music evaluation. We use our new data representation to compare one-dimensional
music melody signals to find out whether there is an objective gap between the generated mel-
ody and the real melody.
As for the smooth-saltatory progression comparison analysis, we make the following expla-
nations to make people understand the meaning of this more intuitively.
In the Fig 13, the change degree between two notes in the blue frame is less than 4 intervals,
which is called “smooth progression”, and the change degree between notes in the red frame is
more than 4 intervals, which is called “saltatory progression”. The change of the notes in music
will have a great impact on people’s hearing feelings. The music is gradual, not high and low.
When we listen to it, if there is a sudden change of more than 4 degrees, it will sometimes feel
very obtrusive, so it is necessary to detect the change of notes.
As for the wave test analysis, we make the following explanations to make people under-
stand the meaning of this more intuitively.
The melody generated in this section of the Fig 14, the notes under the blue arrow conform
to our above rules. Three notes in a row show the same trend and then change in the opposite
direction, which is called a wave. The wave shape test is an important aspect to identify
whether a melody is smart. We can find that this kind of orderly, undulating changes is very

Fig 13. Smooth-saltatory progression comparison analysis.

https://doi.org/10.1371/journal.pone.0283103.g013

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 18 / 21

PLOS ONE Music generation and evaluation method

Fig 14. Wave test analysis.

https://doi.org/10.1371/journal.pone.0283103.g014

Table 2. Music theory evaluation table.

Music Theory Index smooth-saltatory progression comparison Wave inspection Tonality test
MT-GPT-2 0.277976884 145 1
Leak-GAN 0.237623762 134 0.8741215
Music SketchNet 0.257823489 167 0.9321346
LSTM 0.225806452 94 0.7171429
Real Music 0.207687974 123 1
https://doi.org/10.1371/journal.pone.0283103.t002

common in a large number of famous songs. Our purpose here is to find out whether the
model has learned this characteristic.
We use the MT-GPT-2 model and the LSTM [6] model to randomly generate ten pieces of
music, compare them with the real music in the data set, and take the average of each evalua-
tion index as shown in Table 2.
As shown in Table 2, in terms of the music theory of the generated music, we can see that
the music generated by MT-GPT-2 model is very similar to real music in all aspects, and has
more variability, and the mode is all in C major.
From the above Table 2, we can find that in our MT-GPT-2 model, the note changes and
wave test are relatively active. This shows that our model has played a certain role in the learn-
ing of music. The effect of music generation is more musical than LSTM [6], Leak-GAN
model [12] and Music SketchNet [13]. Secondly, our music is very accurate, all in the range of
C major, so the generated melody is more in line with the aesthetic sense of hearing.

Conclusion
In this paper, we described the MT-GPT-2 model, which is a GPT-2 music generation model
based on music textual data. Thanks to this data preprocessing method, we have achieved con-
sistency between text generation and music generation. The usage of special symbols to indi-
cate the delay of the notes and the pause of the music is often overlooked in data processing.
The new textualized melodic approach we proposed is only one of many possible ways to
represent it. We could have combined all the features in different ways (for example, separat-
ing information such as the pitch of a musical melody note and the length of a musical note,
putting them in different vectors rather than together), and represented various information
about music in different ways. We chose this approach because it seemed to be the most
straightforward. In order to apply the large model to our music generation in a convenient
and simple way, we have experimented with this method to ensure the integrity and represen-
tativeness of the music data.The purpose is to be able to use other large language models to
quickly train and generate music.
At the same time, we have established a music melody evaluation model MEM, which is
described through mathematical statistics and music theory. In this way we can evaluate music
as objectively as possible and judge the quality of the model and the generated music.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 19 / 21

PLOS ONE Music generation and evaluation method

Although the music evaluation method proposed by us can give us a more objective view of
the model’s ability to generate musical melodies, it does not reveal anything about the inner
workings of the model. The interpretability of neural networks is a difficult problem, which is
worth for further study.
Finally, the entire study was conducted in the context of MIDI data. However, this repre-
sentation has some limitations. For example, the representation of some notes will be very
long when the extraction of unit time is too small, resulting in the dilution of data. There is
still no good solution for the text representation of multiple tracks. Studying the generation of
chords under the melody and the generation of subsections will be the major research direc-
tion in the future.

Author Contributions
Formal analysis: Yi Guo, Yangcheng Liu.
Methodology: Yangcheng Liu.
Resources: Yi Guo.
Software: Yangcheng Liu.
Validation: Yangcheng Liu.
Visualization: Yangcheng Liu.
Writing – original draft: Yangcheng Liu.
Writing – review & editing: Yi Guo, Ting Zhou, Liang Xu, Qianxue Zhang.

References
1. Hao X, Zhang G, Ma S. Deep Learning. International Journal of Semantic Computing. 2016; 10
(03):417–439. https://doi.org/10.1142/S1793351X16500045
2. KiyohitoFukuda, NaokiMori, KeinosukeMatsumoto. A Novel Sentence Vector Generation Method
Based on Autoencoder and Bi-directional LSTM. In: International Symposium on Distributed Computing
and Artificial Intelligence; 2018.
3. Zhu H, Liu Q, Yuan NJ, Zhang K, Chen E. Pop Music Generation: From Melody to Multi-style Arrange-
ment. ACM Transactions on Knowledge Discovery from Data. 2020; 14(5):1–31. https://doi.org/10.
1145/3374915
4. Sabathé R, Coutinho E, Schuller B. Deep recurrent music writer: Memory-enhanced variational autoen-
coder-based musical score composition and an objective measure. In: 2017 International Joint Confer-
ence on Neural Networks (IJCNN). IEEE; 2017. p. 3467–3474.
5. Zhu H. Research on Automatic composition and Arrangement based on deep Learning. University of
Science and Technology of China;.
6. Mangal S, Modak R, Joshi P. Lstm based music generation system. arXiv preprint arXiv:190801080.
2019;.
7. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for lan-
guage understanding. arXiv preprint arXiv:181004805. 2018;.
8. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised
multitask learners. OpenAI blog. 2019; 1(8):9.
9. Pachet F, Roy P. Markov constraints: steerable generation of Markov sequences. Constraints. 2011; 16
(2):p.148–172. https://doi.org/10.1007/s10601-010-9101-4
10. Keerti G, Vaishnavi A, Mukherjee P, Vidya AS, Sreenithya GS, Nayab D. Attentional networks for music
generation. arXiv preprint arXiv:200203854. 2020;.
11. Zhang N. Learning adversarial transformer for symbolic music generation. IEEE Transactions on Neural
Networks and Learning Systems. 2020;.
12. Guo J, Lu S, Han C, Zhang W, Wang J. Long Text Generation via Adversarial Training with Leaked
Information. 2017;.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 20 / 21

PLOS ONE Music generation and evaluation method

13. Chen K, Wang CI, Berg-Kirkpatrick T, Dubnov S. Music SketchNet: Controllable Music Generation via
Factorized Representations of Pitch and Rhythm; 2020.
14. Chordia P, Sastry A, Şentürk S. Predictive tabla modelling using variable-length markov and hidden
markov models. Journal of New Music Research. 2011; 40(2):105–118. https://doi.org/10.1080/
09298215.2011.576318
15. Van Der Merwe A, Schulze W. Music generation with markov models. IEEE MultiMedia. 2010; 18
(3):78–85. https://doi.org/10.1109/MMUL.2010.44
16. Bretan M, Weinberg G, Heck L. A unit selection methodology for music generation using deep neural
networks. arXiv preprint arXiv:161203789. 2016;.
17. Pachet F, Papadopoulos A, Roy P. Sampling Variations of Sequences for Structured Music Generation.
In: ISMIR; 2017. p. 167–173.
18. Franklin JA. Recurrent neural networks for music computation. INFORMS Journal on Computing. 2006;
18(3):321–338. https://doi.org/10.1287/ijoc.1050.0131
19. Goel K, Vohra R, Sahoo JK. Polyphonic Music Generation by Modeling Temporal Dependencies Using
a RNN-DBN. Springer International Publishing. 2014;.
20. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial
nets. Advances in neural information processing systems. 2014; 27.
21. Roberts A, Engel J, Raffel C, Hawthorne C, Eck D. A Hierarchical Latent Vector Model for Learning
Long-Term Structure in Music; 2018.
22. Huang CZA, Vaswani A, Uszkoreit J, Shazeer N, Hawthorne C, Dai A, et al. Music transformer: Gener-
ating music with long-term structure (2018). arXiv preprint arXiv:180904281. 2018;.
23. Hadjeres G, Pachet F, Nielsen F. Deepbach: a steerable model for bach chorales generation. In: Inter-
national Conference on Machine Learning. PMLR; 2017. p. 1362–1371.
24. Li C. Music Theory Foundation. People’s Music Publishing House; 1962.
25. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-xl: Attentive language mod-
els beyond a fixed-length context. arXiv preprint arXiv:190102860. 2019;.
26. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: Generalized autoregressive pre-
training for language understanding. Advances in neural information processing systems. 2019; 32.
27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In:
Advances in neural information processing systems; 2017. p. 5998–6008.
28. Schoenberg A, Stein L, Strang G. Fundamentals of musical composition. Faber & Faber London; 1967.
29. Hogg RV, McKean J, Craig AT. Introduction to mathematical statistics. Pearson Education; 2005.
30. Wang Z, Chen K, Jiang J, Zhang Y, Xu M, Dai S, et al. Pop909: A pop-song dataset for music arrange-
ment generation. arXiv preprint arXiv:200807142. 2020;.
31. Cuthbert MS, Ariza C. music21: A toolkit for computer-aided musicology and symbolic music data.
2010;.
32. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. Tensorflow: Large-scale machine
learning on heterogeneous distributed systems. arXiv preprint arXiv:160304467. 2016;.
33. Ketkar N. Introduction to keras. In: Deep learning with Python. Springer; 2017. p. 97–111.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 21 / 21

Introduction to the Study of Video Game Music
From Everand
Introduction to the Study of Video Game Music
Alyssa Aska
5/5 (2)
Activity 4.3 Central Tendency and Variability Measures
0% (1)
Activity 4.3 Central Tendency and Variability Measures
2 pages
Melody Generation Using An Interactive Evolutionary Algorithm
No ratings yet
Melody Generation Using An Interactive Evolutionary Algorithm
6 pages
ji-yang-luo-survey-symbolic-music-generation
No ratings yet
ji-yang-luo-survey-symbolic-music-generation
39 pages
Music Generation with NLP-1
No ratings yet
Music Generation with NLP-1
15 pages
Article - April 26th Version
No ratings yet
Article - April 26th Version
4 pages
A Survey of AI Music Generation Tools and Models
No ratings yet
A Survey of AI Music Generation Tools and Models
39 pages
A Review of Intelligent Music Generation Systems: Lei Wang, Ziyi Zhao, Hanwei Liu, Junwei Pang, Yi Qin and Qidi Wu
No ratings yet
A Review of Intelligent Music Generation Systems: Lei Wang, Ziyi Zhao, Hanwei Liu, Junwei Pang, Yi Qin and Qidi Wu
28 pages
Chatmusician: Understanding and Generating Music Intrinsically With LLM
No ratings yet
Chatmusician: Understanding and Generating Music Intrinsically With LLM
19 pages
Music PPT (3.1)
No ratings yet
Music PPT (3.1)
13 pages
EmotionBox_A_music_element_driven_emotio
No ratings yet
EmotionBox_A_music_element_driven_emotio
14 pages
Automatic Music Generation
No ratings yet
Automatic Music Generation
16 pages
Mathematics 11 00798
No ratings yet
Mathematics 11 00798
14 pages
Simple and Controllable Music Generation: : Equal Contributions, : Core Team
No ratings yet
Simple and Controllable Music Generation: : Equal Contributions, : Core Team
15 pages
MusicAL - Go - Algorithmic Music Generation
No ratings yet
MusicAL - Go - Algorithmic Music Generation
9 pages
Project Final Document
No ratings yet
Project Final Document
80 pages
2+ijrise 2023 1083
No ratings yet
2+ijrise 2023 1083
3 pages
MidiNet A Convolutional Generative Adversarial Net
No ratings yet
MidiNet A Convolutional Generative Adversarial Net
9 pages
CHEN 2001 CreatingMelodieswithEvolvingRecurrentNeuralNetworks PDF
No ratings yet
CHEN 2001 CreatingMelodieswithEvolvingRecurrentNeuralNetworks PDF
6 pages
Music Gen
No ratings yet
Music Gen
17 pages
Project
No ratings yet
Project
25 pages
Music Generation Using Recurrent Neural Networks
No ratings yet
Music Generation Using Recurrent Neural Networks
9 pages
A Review of Intelligent Music Generation Systems
No ratings yet
A Review of Intelligent Music Generation Systems
24 pages
A Comprehensive Survey On Deep Music Generation
No ratings yet
A Comprehensive Survey On Deep Music Generation
96 pages
PACHET_-BRIOT.Deeplearningformusicgeneration
No ratings yet
PACHET_-BRIOT.Deeplearningformusicgeneration
14 pages
Midinet: A Convolutional Generative Adversarial Network For Symbolic-Domain Music Generation
No ratings yet
Midinet: A Convolutional Generative Adversarial Network For Symbolic-Domain Music Generation
8 pages
Pradesh DL
No ratings yet
Pradesh DL
9 pages
Automatic Music Generation Using Deep Learning: Abstract
No ratings yet
Automatic Music Generation Using Deep Learning: Abstract
12 pages
Automatic Music Generation
No ratings yet
Automatic Music Generation
16 pages
Want To Generate Your Own Music Using Deep Learning? Here's A Guide To Do Just That!
No ratings yet
Want To Generate Your Own Music Using Deep Learning? Here's A Guide To Do Just That!
17 pages
Música - An Intelligent Music Production Technology Based on Generation
No ratings yet
Música - An Intelligent Music Production Technology Based on Generation
10 pages
Auto Generation of Music by GAN
No ratings yet
Auto Generation of Music by GAN
1 page
MusicGen by Meta Research: AI Model For Music Generation With Text and Melody
No ratings yet
MusicGen by Meta Research: AI Model For Music Generation With Text and Melody
6 pages
Generating Black Metal and Math Rock
No ratings yet
Generating Black Metal and Math Rock
3 pages
Algorithmic Composition of Melodies With Deep Recurrent Neural Networks
100% (1)
Algorithmic Composition of Melodies With Deep Recurrent Neural Networks
12 pages
App and Advances
No ratings yet
App and Advances
19 pages
Getmusic
No ratings yet
Getmusic
15 pages
Artificial Intelligence in Music Recent Trends And
No ratings yet
Artificial Intelligence in Music Recent Trends And
40 pages
Music Compostion With Magenta
No ratings yet
Music Compostion With Magenta
2 pages
Generating Musical Sequences With Transformers - 1
No ratings yet
Generating Musical Sequences With Transformers - 1
6 pages
Clavi Net
No ratings yet
Clavi Net
11 pages
Generating Music Using AI: Ebba Rickard
No ratings yet
Generating Music Using AI: Ebba Rickard
66 pages
música - inteligencia musical
No ratings yet
música - inteligencia musical
11 pages
DeepClassic: Music Generation With Neural Neural Networks
No ratings yet
DeepClassic: Music Generation With Neural Neural Networks
6 pages
Abstract
No ratings yet
Abstract
1 page
Composing Music With Neural Networks and Probabilistic Finite-State Machines
100% (1)
Composing Music With Neural Networks and Probabilistic Finite-State Machines
6 pages
AI Text to Music
No ratings yet
AI Text to Music
2 pages
WIMP2017 Martinez-RamirezReiss
No ratings yet
WIMP2017 Martinez-RamirezReiss
4 pages
FINAL_PROJECT
No ratings yet
FINAL_PROJECT
2 pages
A Hierarchical Recurrent Neural Network For Symbolic Melody Generation
No ratings yet
A Hierarchical Recurrent Neural Network For Symbolic Melody Generation
9 pages
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
No ratings yet
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
15 pages
Melody Generator A Device For Algorithmic Music Co PDF
100% (1)
Melody Generator A Device For Algorithmic Music Co PDF
13 pages
Povel, D. J. (2010) - Melody Generator - A Device For Algorithmic Music Construction. Journal of Software Engineering & Applications, 3, 683-695.
100% (1)
Povel, D. J. (2010) - Melody Generator - A Device For Algorithmic Music Construction. Journal of Software Engineering & Applications, 3, 683-695.
13 pages
Acapella-Based Music Generation With Sequential Models Utilizing Discrete Cosine Transform
No ratings yet
Acapella-Based Music Generation With Sequential Models Utilizing Discrete Cosine Transform
10 pages
Deep Learning Neural Networks For Music Information Retrieval
No ratings yet
Deep Learning Neural Networks For Music Information Retrieval
4 pages
Guo et al. - 2022 - MusIAC an extensible generative framework for mus
No ratings yet
Guo et al. - 2022 - MusIAC an extensible generative framework for mus
17 pages
Melodic Swarms: Using Swarm Algorithms For The Generation of Melodies in A Harmonic Context
No ratings yet
Melodic Swarms: Using Swarm Algorithms For The Generation of Melodies in A Harmonic Context
29 pages
16117-Article Text-19611-1-2-20210518
No ratings yet
16117-Article Text-19611-1-2-20210518
10 pages
Ben Tal O 41834 AAM
No ratings yet
Ben Tal O 41834 AAM
37 pages
Computer Audition: Fundamentals and Applications
From Everand
Computer Audition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer-aided Ear-training: A Contemporary Approach to Kodály's Music Educational Philosophy
From Everand
Computer-aided Ear-training: A Contemporary Approach to Kodály's Music Educational Philosophy
Susanna Király
No ratings yet
PT report- Final-H03
No ratings yet
PT report- Final-H03
29 pages
CANalyzer ProductInformation en
No ratings yet
CANalyzer ProductInformation en
33 pages
AL-302-Introduction-to-Probability-and-Statistics
No ratings yet
AL-302-Introduction-to-Probability-and-Statistics
2 pages
Final Report Econometrics
No ratings yet
Final Report Econometrics
26 pages
Forecasting Learning Objectives
No ratings yet
Forecasting Learning Objectives
25 pages
Nancyvita 98
No ratings yet
Nancyvita 98
9 pages
Music Preference and Emotion Regulation in Young Adults
No ratings yet
Music Preference and Emotion Regulation in Young Adults
4 pages
SABER_EMIS_Rubric - Questionnaire
No ratings yet
SABER_EMIS_Rubric - Questionnaire
10 pages
ECO202 Practice Notes
No ratings yet
ECO202 Practice Notes
2 pages
EWMA Cusum Charts Examples Ch09
No ratings yet
EWMA Cusum Charts Examples Ch09
4 pages
Characteristic Function Exercises
No ratings yet
Characteristic Function Exercises
10 pages
English - 3is - Q2 - LP 9
No ratings yet
English - 3is - Q2 - LP 9
12 pages
Lecture 1.1 CQF 2010 - B
No ratings yet
Lecture 1.1 CQF 2010 - B
52 pages
CS1B April22 EXAM Clean Proof
No ratings yet
CS1B April22 EXAM Clean Proof
5 pages
Classification ZeroR and OneR
No ratings yet
Classification ZeroR and OneR
9 pages
College of Criminology: Central Bicol State University of Agriculture
No ratings yet
College of Criminology: Central Bicol State University of Agriculture
5 pages
RAPPORT MBOMBA MINING CFR HUGUES(3)(2)
No ratings yet
RAPPORT MBOMBA MINING CFR HUGUES(3)(2)
76 pages
The 3-Parameter Log Normal Distribution and Its Applications in Hydrology
No ratings yet
The 3-Parameter Log Normal Distribution and Its Applications in Hydrology
12 pages
Inquiries Module 5 Quarter 2 2nd Sem
No ratings yet
Inquiries Module 5 Quarter 2 2nd Sem
22 pages
(Ebook) Introductory Econometrics for Finance by Chris Brooks ISBN 9780521873062, 9780521694681, 0521873061, 052169468X download
100% (1)
(Ebook) Introductory Econometrics for Finance by Chris Brooks ISBN 9780521873062, 9780521694681, 0521873061, 052169468X download
51 pages
Climate Change and Plant Phenology
No ratings yet
Climate Change and Plant Phenology
13 pages
2019 Class Test 1 MEMO
No ratings yet
2019 Class Test 1 MEMO
6 pages
09 Multi-Equations Econometrics Model
No ratings yet
09 Multi-Equations Econometrics Model
26 pages
Uts Master Data Science and Innovation Mdsi Course Guide 2023
No ratings yet
Uts Master Data Science and Innovation Mdsi Course Guide 2023
16 pages
Quality of Dental Health Service in Indonesia: A Pilot Pathfinder Survey
No ratings yet
Quality of Dental Health Service in Indonesia: A Pilot Pathfinder Survey
8 pages
LAN Switching and WAN Networks
No ratings yet
LAN Switching and WAN Networks
29 pages
String Functions: Extract 1st Word From String "Name"
No ratings yet
String Functions: Extract 1st Word From String "Name"
28 pages
Shiro
No ratings yet
Shiro
13 pages
Probability Assignment 2017
No ratings yet
Probability Assignment 2017
4 pages

Music Generation with NLP-3

Uploaded by

Music Generation with NLP-3

Uploaded by

PLOS ONE

An automatic music generation and

Peer Review History: PLOS recognizes the

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 1 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 2 / 21

variational auto-encoder(VAEs) to learn the distribution of music fragments. In addition,

Formats of music data

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 3 / 21

Fig 1. Representation of music data and its conversion.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 4 / 21

Fig 3. System flow chart.

A novel method of musical melody text representation

Fig 4. Flow chart of data preprocessing.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 5 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 6 / 21

GPT-2 based on music textualization(MT-GPT-2)

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 7 / 21

Fig 6. GPT-2 model flow chart.

hl ¼ transformerblockðhl 1 Þ8l 2 ½1; n� ð4Þ

PðuÞ ¼ softmaxðhn WeT Þ ð5Þ

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 8 / 21

paper [27], and the self-attention mechanism formula is as follows:

Fig 7. GPT-2 model fine-tuning.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 9 / 21

Music evaluation method(MEM)

Fig 8. Block diagram of music evaluation model.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 10 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 11 / 21

Dðan Þ ¼ anþ1 an ð7Þ

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 12 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 13 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 14 / 21

Comparison with different initial notes

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 15 / 21

Comparison with other models

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 16 / 21

Evaluation using MEM

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 17 / 21

Table 1. Mathematical evaluation table.

Fig 13. Smooth-saltatory progression comparison analysis.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 18 / 21

Fig 14. Wave test analysis.

Table 2. Music theory evaluation table.

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 19 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 20 / 21

PLOS ONE | https://doi.org/10.1371/journal.pone.0283103 May 10, 2023 21 / 21

You might also like