0% found this document useful (0 votes)
10 views

Music Transcription Based On Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Music Transcription Based On Deep Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2023 International Conference on New Trends in Computational Intelligence (NTCI 2023)

Music Transcription Based on Deep Learning


Hongzhuo Chen
School of Computer Science and Engineering, Southeast University, Nanjing, China
Henry Samueli School of Engineering, University of California, Irvine, USA
[email protected]

Abstract—Automatic Music Transcription (AMT), facilitating transcription[1]. As a generative model, it first gradually adds
2023 International Conference on New Trends in Computational Intelligence (NTCI) | 979-8-3503-8086-6/23/$31.00 ©2023 IEEE | DOI: 10.1109/NTCI60157.2023.10403698

music composition, editing, and performance, aims at transcrib- noise to the mel spectrograms of the input music to destroy the
ing music to scores that can be read and modified. It also plays data distribution, then trains a model to reverse the process to
a part in enhancing efficiency in music work, music education,
and protecting music copyrights. In this paper, we reviewed two restore the music. Apart from music transcription, it can also
deep learning-based AMT models: MT3 based on T5 transformer, be used for music generation and inpainting.
and DiffRoll based on diffusion model. We also analyzed the In this paper, we aimed at comparing two different types of
performance of the two models on different music datasets like deep learning model for AMT. We analyzed the network struc-
MAESTRO and MAPS and explored their unique features. ture of Transformer-based MT3 and diffusion-based DiffRoll.
Index Terms—Automatic Music Transcription, Deep Learning,
Diffusion, Transformer We trained two models on MAESTRO [5] dataset and compare
the F1 scores of the two models. We also explored the
I. I NTRODUCTION unique properties of these two models in other music-related
tasks, including MT3’s performance in multi-instrument music
Automatic music transcription is an important and challeng- transcription, and DiffRoll’s performance in music generation,
ing task in deep learning. It requires Multi-pitch estimation, music inpainting. Also, we trained the DiffRoll model in an
onset and offset detection, instrument recognition, tracking unsupervised way.
tempo, score arrangement, as well as other subtasks. AMT The main contributions of this work can be summarized in
models can transcript music to readable and editable scores the following items:
(e.g. MIDI, piano roll), therefore it is widely used in music • We compared the performance of two different models
composing, playing, and editing. AMT also contributes to (DiffRoll, MT3) by training them on MAESTRO dataset
improving efficiency in music-related tasks, music education, and comparing the F1 scores of different models.
as well as music copyright protection. With the development of • We used MusicNet [12], a multi-instrument music dataset
deep learning, neural network-based models have been adapted to train MT3 and tested its performance on multi-
to tasks with music, including AMT. instrument music transcription.
Music Transcription can be a problem of music sequence to • We trained DiffRoll model in an unsupervised way. In
note sequence or sequence-to-sequence problem. Transformer addition, we trained the model to complete other tasks
[13], a model with the attention mechanism, is applied in including music generation and music inpainting, which
music transcription [4]. This work adopted a T5 transformer is the model’s unique feature thanks to diffusion model’s
[9] framework with small modifications. It can be used for generating ability.
piano music transcription. Based on that model, Gardner et al.
proposed MT3 [3], a model also based on T5 transformer but II. R ELATED W ORK
can transcribe both piano and multi-instrument music. Both of
the models are trained in a supervised manner, unlike Raffel et A. T5 Transformer
al. [9] trained T5 transformer using unsupervised training first As a model based on attention, Transformer outperforms
to learn the features of the data and using supervised training most neural network models in many tasks, especially on
afterward so that the model can adapt to different tasks (e.g. sequence information retrieval tasks. Among various Trans-
classification, translation). Supervised learning needs labels to former models, T5 Transformer contains an encoder and a
learn representations but labeling data can be labor intensive. decoder. The input is encoded into context vectors by the
Most AMT models treat the task as a discriminative task, encoder, and the vectors are decoded by the decoder into
which means AMT models need to detect and describe dif- target text. In natural language processing tasks, for example,
ferent properties of music. An AMT model first analyzes the the input of T5 is task description in textual forms, such as
input music to extract musical notes, pitch, start time, and end ”Summarize this text”. The input is encoded into embedding
time of notes, in which signal processing and spectrogram vectors. The training of T5 includes unsupervised pretraining
analysis are involved. Afterward, it translates the extracted and supervised fine-tuning. During the unsupervised pretrain-
properties to use some musical signals for representation. ing the model encodes and decodes a dataset with no labels
As a generative model inspired by nonequilibrium ther- to learn features of the original data. After that, fine-tuning
modynamics, diffusion model[10] is also used in music helps the model adapt to specific tasks like classification and

979-8-3503-8086-6/23/$31.00 ©2023 IEEE 62


Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on February 01,2024 at 18:04:20 UTC from IEEE Xplore. Restrictions apply.
translation. Fine-tuning enables T5 to adjust itself to different by the conditional output x̂0 (cmel ) (output when there is
tasks. a spectrogram input) and x̂0 (−1) (outut when there’s no
Hawthorne et al.[4] proposed a T5-based music transcription spectrogram input, which is defined by cmel = −1). The
model. As in Fig. 1 the model’s input is the spectrogram of the value of σt is calculated by DDPM (Denoising Diffusion
music, and the output is a MIDI-like output tokens. Therefore, Probabilistic Models)[6]
it’s a sequence-to-sequence model for music transcription. procedure S AMPLING
Unlike the original T5, the work of Hawthorne et al. trained the xT ∼ N (0, I)
model in a supervised way. Based on this model, Gardner et al. for t=T...1 do
proposed MT3[3] that can transcript multi-instrument music. if t > 1 then
The output MIDI-like tokens of MT3 contains word indicating ϵ ∼ N (0, I)
different instruments so that the model can transcript multi- end
instrument music. else
ϵ=0
end
x̂0 = (1 + w)x̂0 (cmel ) − √ wx̂0 (−1)

ϵθ (t) = (xt − ᾱt−1 x̂ 0 )( 1 − ᾱt )−1
√ p
xt−1 = ᾱt−1 x̂0 + 1 − ᾱt−1 − σt2 ϵθ (t)+σt ϵ
end
end procedure
Fig. 1: Architecture of [4]
Algorithm 1: Sampling in the reverse diffusion process
.
The training process of DiffRoll is shown in Fig. 3. When
training DiffRoll, we used Classifier-Free Guidance (CFG) to
B. Diffusion and DiffRoll train the same model in both supervised and unsupervised
Inspired by nonequilibrium thermodynamics, diffusion ways. As in Fig. 3, we used a dropout layer to mask data
model [10] iteratively destroys the data distribution in the from input batch randomly with a probability of p, which
training dataset by adding noise in the diffusion process, and means setting the mel spectrogram cmel = −1. Considering
trains a model that can restore the data by the reverse diffusion that cmel ∈ [0, 1], setting cmel = −1 can avoid confusion.
process. Diffusion process and its reverse diffusion process can Masking some samples can avoid misunderstanding for mod-
help the model fully learn the probabilistic information during els.
the process. Diffusion model is a generative model and can be
used in generation, inpainting, etc.
DiffRoll[1] is the first model introducing diffusion to music
transcription. It treats the music transcription as a generative
task, which means the model is trained to generate poste-
riorgrams from the spectrograms of the music. During the
training process, the model add Gaussian noise to the input
xroll iteratively into xt using Eq.1,in which ϵ ∼ N (0, I), the
Qt
diffusion rate ᾱt = s=1 αs and α0 = 1. In the case of
DiffRoll, αt ∈ [0.98, 0.9999]
√ √
xt = ᾱt xroll + 1 − ᾱt ϵ (1)

Fig. 3: The diffusion and reverse process of DiffRoll[1]


.
Fig. 2: The diffusion and reverse process of DiffRoll[1]
. III. M ETHODOLOGY
During the reverse process, DiffRoll uses the Algo. 1 to In this paper, we study the performance of DiffRoll and
restore the posteriogram, and binarize the posteriogram into MT3 in piano music transcription. We trained DiffRoll and
a MIDI file (MIDI is a binarized file). In every iteration MT3 models and compare their performances on MAESTRO
of the reverse process, noise ϵ needs to be sampled from v3 dataset [5].
a normal distribution N (0, I). The output x̂0 is calculated This is an introduction to the dataset used.

63
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on February 01,2024 at 18:04:20 UTC from IEEE Xplore. Restrictions apply.
• MAESTRO v3: MAESTRO v3 (MIDI and Audio Edited
for Synchronous TRacks and Oganization) is collected
when playing a MIDIfied piano. It contains 1184 piano
pieces of 172.3 hours. Its training data contain 954
pieces, validation data contain 105 pieces, and testing
data contain 125 pieces.
• MusicNet: MusicNet contains music of multi-instruments
and annotations (MIDI file) labelled mannually, which
means the the label of this dataset less accurate.
This work adopted F1 score to evaluate the accuracy of the
model. mir eval[8] is a Python library calculating different F1
scores.
• Frame F1: Every second of the music is divided into
62.5 grames and the sequence of notes is encoded into a Fig. 4: Comparison between the diffused spectrogram xt−1
[frame*128] matrix. Frame F1 estimates whether frames and Gaussian noise at diffusion step t = 200
are aligned. .
• Note F1: Note F1 compares the onset, offset, and pitch of
a note. To estimate a note correctly, the estimated note
ρ̂ need to have the same pitch with the corresponding B. Comparison between MT3 and DiffRoll
original note ρ and the onset time of two notes cannot We trained MT3 and supervised DiffRoll using the training
be longer than 50 ms. set of MAESTRO and tested both models on the testing set of
• Onset F1: This F1 score considers an estimation to be the same dataset. MT3 achieved a higher frame F1 score (0.88)
correct if estimated note and original note have the same compared to DiffRoll (0.71). The dropout rate p of DiffRoll
pitch and the difference of beginning time of two notes is set to 0.1.
is smaller than 50 ms.
• Onset-Offset F1: Apart from the conditions in Onset F1, C. DiffRoll trained using different ways
onset-offset F1 score also considers the end time of the Tab. I is DiffRoll’s Note F1 using different training methods,
note. p denotes the dropout rate of the model, which means the prob-
• Multi-instrument F1: This is used by MT3 as a score ability of dropping the input spectrogram (setting cmel = −1).
for multi-instrument music transcription. To estimate cor- w is the value in algo. 1. During the following settings, p = 0.1
rectly, the estimated instrument must be the same with the unsupervised mean the model is pre-trained on MAESTRO
original one. with a dropout rate of 0.1 and trained supervisedly on MAPS
In addition, we also tested MT3’s multi-instrument music [2] dataset. p = 0 + 1 means the model is trained on MAPS
transcription ability. We trained the model on MusicNet dataset with the p set to 0 and trained on MAESTRO with the p set
[12] and used onset, onset+offset, and multi-instrument F1 to 1.
score to show its performance. For DiffRoll models, we tested
the unsupervised training of DiffRoll in music transcription, TABLE I: DiffRoll’s Note F1 using different training methods
as well as music generation and music inpainting. Model Note F1, w = 0 Note F1, w = 0.5
p = 0.1 Supervised 0.56 0.68
IV. E XPERIMENT p = 0.1 Unsupervised 0.59 0.63
p = 0 + 1 Unsupervised 0.61 0.64
In this paper we trained MT3 [3] and DiffRoll [1] on
MAESTRO dataset [5] to compare their performances. We also
trained MT3 on MusicNet [12] to learn the multi-instrument D. MT3 in Multi-instrument music transcription
music transcription of MT3. Furthermore, we trained DiffRoll
in an unsupervised way, performing music generation and We tested MT3 on the testing data of MusicNet, a dataset
inpainting using DiffRoll. with multi-instrument music. It achieves an onset F1 of 0.50,
onset+offset F1 of 0.33, and multi-instrument F1 of 0.34.
A. Performance of supervised DiffRoll
When testing the performance of supervised DiffRoll, we E. DiffRoll’s generation and inpainting
designed a discriminative model with the same network archi- In this paper, we tested DiffRoll’s music generation. The
tecture as DiffRoll. Fig. 4 shows the performance of DiffRoll total diffusion steps are 200. As in Fig. 5, xt−1 is similar to
and the discriminative model. As in the figure, note-F1 score of Gaussian noise. After the reverse diffusion process and t = 1,
DiffRoll model decreases as dropout rate p increases because the spectrogram is as in Fig. 6, there are clear chords.
the model have a lower probability of getting data paired with DiffRoll can also be used in music inpainting. As in Fig.
labels. Also, when p = 0.1 and the value w in Algorithm 1 is 7, the spectrogram circled by red is covered by cmel = −1.
set to -0.5, the model reaches its peak score of 0.71. And the model impainted the circled part with notes.

64
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on February 01,2024 at 18:04:20 UTC from IEEE Xplore. Restrictions apply.
instrument dynamic information in the output token of the
MT3 training dataset. Therefore we can add more details in
multi-instrument dataset labels so that transcriptions can be
more accurate. It’s also possible to apply different Transformer
models (e.g. Swin [7]) in music understanding.
Considering the development of generative models in nat-
ural language processing and computer vision, generative
models may play a more important role in music information
retrieval. We can explore other generative models in music
representation learning. For example, consistency models pro-
Fig. 5: Comparison between the diffused spectrogram xt−1 posed by Song et al. in 2023[11] can be trained by distilling
and Gaussian noise at diffusion step t = 200 pre-trained diffusion models or as a generation model alone.
. Consistency models support fast one-step design and zero data
editing. Also, we can use neural networks with optimized
structures (e.g. in [14]) for music transcription tasks.
R EFERENCES
[1] Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata,
Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, and Yuki
Mitsufuji. Diffroll: Diffusion-based generative music transcription
with unsupervised pretraining capability. In ICASSP 2023-2023 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 1–5. IEEE, 2023.
[2] Valentin Emiya, Nancy Bertin, Bertrand David, and Roland Badeau.
Maps-a piano database for multipitch estimation and automatic tran-
scription of music. 2010.
[3] Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse
Engel. Mt3: Multi-task multitrack music transcription. arXiv preprint
Fig. 6: Comparison between the diffused spectrogram xt−1 arXiv:2111.03017, 2021.
and Gaussian noise at diffusion step t = 1 [4] Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, and Jesse
. Engel. Sequence-to-sequence piano transcription with transformers.
arXiv preprint arXiv:2107.09142, 2021.
[5] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-
Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and
V. C ONCLUSION Douglas Eck. Enabling factorized piano music modeling and generation
In this paper, we tested two different models’ performance with the maestro dataset. arXiv preprint arXiv:1810.12247, 2018.
[6] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion prob-
on music transcription tasks and tested their unique applica- abilistic models. Advances in Neural Information Processing Systems,
tions. MT3 outperforms DiffRoll by 24% and can be used in 33:6840–6851, 2020.
multi-instrument music transcription. [7] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision
As a diffusion model, DiffRoll can be used for multiple transformer using shifted windows. In Proceedings of the IEEE/CVF
tasks including generation, inpainting. As the first work to international conference on computer vision, pages 10012–10022, 2021.
apply diffusion model to music transcription, it proved the [8] Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol
Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. Mir eval:
possibility of using diffusion model to learn music represen- A transparent implementation of common mir metrics. In ISMIR, pages
tations. 367–372, 2014.
[9] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
VI. F UTURE W ORK Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Explor-
ing the limits of transfer learning with a unified text-to-text transformer.
When listening to the results of multi-instrument transcrip- The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
tion of MT3 model, it can be noticed that they lack instrument [10] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya
Ganguli. Deep unsupervised learning using nonequilibrium thermody-
dynamics information. This can be caused by the lack of namics. In International Conference on Machine Learning, pages 2256–
2265. PMLR, 2015.
[11] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consis-
tency models. arXiv preprint arXiv:2303.01469, 2023.
[12] John Thickstun, Zaid Harchaoui, and Sham Kakade. Learning features
of music from scratch. arXiv preprint arXiv:1611.09827, 2016.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
2017.
[14] Xuetao Xie, Huaqing Zhang, Junze Wang, Qin Chang, Jian Wang, and
Nikhil R Pal. Learning optimized structure of neural networks by
hidden node pruning with l {1} regularization. IEEE Transactions on
cybernetics, 50(3):1333–1346, 2019.
Fig. 7: Music inpainting by DiffRoll

65
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on February 01,2024 at 18:04:20 UTC from IEEE Xplore. Restrictions apply.

You might also like