0% found this document useful (0 votes)
31 views

A survey of deep learning audio generation methods

This article reviews deep learning methods for audio generation, covering audio representations, deep learning architectures, and evaluation metrics. It discusses various architectures such as autoencoders, generative adversarial networks, and transformer networks, along with their applications in audio generation. The aim is to provide a comprehensive understanding of the current state of audio generation methods for beginners in the field.

Uploaded by

nsteja1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

A survey of deep learning audio generation methods

This article reviews deep learning methods for audio generation, covering audio representations, deep learning architectures, and evaluation metrics. It discusses various architectures such as autoencoders, generative adversarial networks, and transformer networks, along with their applications in audio generation. The aim is to provide a comprehensive understanding of the current state of audio generation methods for beginners in the field.

Uploaded by

nsteja1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

A survey of deep learning audio generation methods


Matej Božić∗ , Marko Horvat†
Department of Applied Computing
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia
[email protected][email protected]

Abstract—This article presents a review of typical techniques Raw waveform


used in three distinct aspects of deep learning model development IV. Audio features Mel-spectrogram
arXiv:2406.00146v1 [cs.SD] 31 May 2024

for audio generation. In the first part of the article, we provide


an explanation of audio representations, beginning with the Neural codecs
fundamental audio waveform. We then progress to the frequency Auto-encoder
domain, with an emphasis on the attributes of human hearing,
and finally introduce a relatively recent development. The main Generative adversarial networks
part of the article focuses on explaining basic and extended

Survey
V. Architectures Normalizing flows
deep learning architecture variants, along with their practical
applications in the field of audio generation. The following archi- Transformer networks
tectures are addressed: 1) Autoencoders 2) Generative adversarial Diffusion models
networks 3) Normalizing flows 4) Transformer networks 5)
Diffusion models. Lastly, we will examine four distinct evaluation Human evaluation
metrics that are commonly employed in audio generation. This Inception score
article aims to offer novice readers and beginners in the field a VI. Evaluation metrics
comprehensive understanding of the current state of the art in Fréchet distance
audio generation methods as well as relevant studies that can be Kullback-Leibler divergence
explored for future research.
Index Terms—Deep learning, Audio representations, Audio Figure 1. The main sections of the survey.
generation, Generative models, Sound synthesis

audio generation. Some areas, such as text-to-speech, will be


I. I NTRODUCTION
more heavily represented as they have received more attention,
HE trend towards deep learning in Computer Vision
T (CV) and Natural Language Processing (NLP) has also
reached the field of audio generation [1]. Deep learning has
but an attempt has been made to include many different
subfields. This article does not attempt to present all possible
methods but only introduces the reader to some of the popular
allowed us to move away from the complexity of hand-crafted methods in the field. Each listing of works on a topic is sorted
features towards simple representations by letting the depth of so that the most recent articles are at the end.
the model create more complex mappings. We define audio The article is structured as follows: section II presents
generation as any method whose outputs are audio and cannot previous work dealing with deep learning in audio, section III
be derived solely from the inputs. Even though tasks such gives a brief overview of previous audio generation methods
as text-to-speech involve translation from the text domain to in text-to-speech and music generation, section IV deals with
the speech domain, there are many unknowns, such as the the two most prominent features and a recent advancement,
speaker’s voice. This means that the models have to invent section V discusses five deep learning architectures and some
or generate information for the translation to work. There of their popular extensions, and finally, section VI looks at
are many applications for audio generation. We can create measuring the performance of generation models, some of
human-sounding voice assistants, generate ambient sounds for which are specific to audio generation, while others are more
games or movies based on the current visual input, create generally applicable.
various music samples to help music producers with ideas or
composition, and much more. The structure of the presented
II. R ELATED WORK
survey on deep learning audio generation methods is illustrated
in Figure 1. In this section, we will mention some of the works that
This article will mainly focus on deep learning methods, as are good sources for further research in the field of audio
the field seems to be developing in this direction. Nevertheless, generation. Some of them investigate only a specific model
section III will examine the development of audio generation architecture or sub-area, while others, like this work, show a
methods over the years, starting around the 1970s. We consider broader view.
this section important because, just as deep learning methods In Zhao, Xia, and Togneri [2], deep learning discriminative
have re-emerged, there may be a time when audio generation and generative architectures are discussed, along with their
methods that are now obsolete become state-of-the-art again. applications in speech and music synthesis. The article covers
The goal is to take a broad but shallow look at the field of discriminative neural networks such as Multi-Layer Perceptron
2

(MLP), Convolutional Neural Networks (CNN), and Recurrent certain use situations, such as Fast TTS, Low-Resource TTS,
Neural Networks (RNN), as well as generative neural net- and Robust TTS.
works like Variational Autoencoders (VAE) and Deep Belief Shi [7] discusses TTS, music generation, audiovisual multi-
Networks (DBN). They also describe generative adversarial modal processing, and datasets. This effort differs from earlier
networks (GAN), their flaws, and enhancement strategies (with ones in that it organizes relevant articles by category rather
Wasserstein GAN as a standout). The study mainly focuses on than explaining subjects in depth.
speech generation and doesn’t focus much on different hybrid Natsiou and O’Leary [8] is the closest work to this one. It
models. follows a similar structure, starting with input representations
In contrast, Purwins et al. [3] emphasizes other areas of including raw waveforms, spectrograms, acoustic characteris-
modeling, including feature representations, loss functions, tics, embeddings, and symbolic representations, followed by
data, and evaluation methods. It also investigates a variety of conditioning representations used to guide audio synthesis.
additional application fields, including enhancement as well as Includes audio synthesis techniques such as AR, NF, GAN,
those outside of audio generation, such as source separation, and VAE. The article concludes with the following evalua-
audio classification, and tagging. They describe various audio tion methods: perceptual evaluation, number of statistically
aspects that are not covered here, such as the mel frequency different bins, inception score, distance-based measurements,
cepstral coefficients (MFCC) and the constant-Q spectrogram. spectral convergence, and log likelihood.
They do not cover as many architectures, but they do provide Latif et al. [9] provides an overview of transformer archi-
domain-specific datasets and evaluation methods. tectures used in the field of speech processing. The article
Unlike previous works, Briot, Hadjeres, and Pachet [4] provides a description of the transformer, a list of popular
attempts to comprehensively examine a specific field of audio transformers for speech, and a literature review on its appli-
generation. This study considers five dimensions of music gen- cations.
eration: objective, representation, architecture, challenge, and Zhang et al. [10] surveys TTS and speech enhancement,
strategy. It looks at a variety of representations, both domain- with a focus on diffusion models. Although the emphasis is
specific and more general. Explains the fundamentals of music on diffusion models, they also discuss the stages of TTS,
theory, including notes, rhythm, and chords. Introduces various pioneering work, and specialized models for distinct speech
previously established architectures such as MLP, VAE, RNN, enhancement tasks.
CNN, and GAN, as well as some new ones like the Restricted Mehrish et al. [11] conducted a comprehensive survey of
Boltzmann Machine (RBM). Finally, it discusses the many deep learning techniques in speech processing. It begins with
challenges of music generation and ways for overcoming them. speech features and traditional speech processing models. It
The work is quite extensive; however, some sections may addresses the following deep learning architectures: RNN,
benefit from a more detailed explanation. CNN, Transformer, Conformer, Sequence-to-Sequence models
Huzaifah and Wyse [5] is another work that explores the (Seq2seq), Reinforcement learning, Graph neural networks
subject of music generation and includes music translation. It (GNN), and diffusion probabilistic networks. Explains super-
discusses data representation, generative neural networks, and vised, unsupervised, semi-supervised, and self-directed speech
two popular DNN-based synthesizers. It discusses the issue of representation learning. Finally, it discusses a variety of speech
long-term dependence and how conditioning might alleviate processing tasks, including neural speech synthesis, speech-
it. Explains the autoregressive (AR) and normalized flow (NF) to-speech translation, speech enhancement, audio super reso-
models, as well as VAE and GAN. lution, as well transfer learning techniques.
Peeters and Richard [1] provides an overview of deep
learning techniques for audio. It distinguishes architectures III. BACKGROUND
from meta-architectures. The architectures include MLP, CNN, The main purpose of this section is to show how audio
Temporal Convolutional Networks (TCN), and RNN, while generation has developed over the years up to this point.
the meta-architectures are Auto-Encoders (AE), VAE, GAN, Since audio generation is a broad field that encompasses
Encoder/Decoder, Attention Mechanism, and Transformers. many different areas, such as text-to-speech synthesis, voice
Divides audio representations into three categories: time- conversion, speech enhancement,... [12], we will only focus
frequency, waveform, and knowledge-driven. Time-frequency on two different areas of audio generation: text-to-speech
representations include the Short-Time Fourier Transform synthesis and music generation. There is no particular reason
(STFT), MFCC, Log-Mel-Spectrogram (LMS), and Constant- for this choice, except that they are among the more popular
Q-Transform (CQT). The article concludes with a list of ones. The trend we want to show is how domain-specific
applications for audio deep learning algorithms, including mu- knowledge is shifting towards general-purpose methods and
sic content description, environmental sound description, and how feature engineering is turning into feature recognition.
content processing. It also briefly discusses semi-supervised
and self-supervised learning.
Tan et al. [6] provides a comprehensive overview of TTS A. Text-to-Speech
methods, including history. It explains the basic components Text-to-speech (TTS) is a task with numerous applications,
of TTS systems, such as text analysis, acoustic models, and ranging from phone assistants to GPS navigators. The desire to
vocoders, and includes a list of models in each area. Finally, it construct a machine that can communicate with a human has
discusses advanced methods for implementing TTS systems in historically fueled growth in this subject. Conventional speech
3

synthesis technologies include rule-based concatenative speech HMM. Yoshimura et al. [25] and Tokuda, Zen, and Black
synthesis (CSS) and statistical parametric speech synthesis [26] use decision-tree-based context clustering to represent
(SPSS) [13]. CSS and SPSS, which employ speech data, may spectrum, pitch, and HMM state duration simultaneously.
be considered corpus-based speech synthesis approaches [14]. Commonly used contexts include the current phoneme, pre-
Until the late 1980s, the field was dominated by rule-based ceding and succeeding phonemes, the position of the current
systems [15]. They were heavily reliant on domain expertise syllable within the current word or phrase, etc. [27].
such as phonological theory, necessitating the collaboration of The notion that the human speech system has a layered
many experts to develop a comprehensive rule set that would structure in its transformation of the linguistic level to the
generate speech parameters. There are numerous works like waveform level has stimulated the adoption of deep neural
Liberman et al. [16], Holmes, Mattingly, and Shearme [17], network speech synthesis [28]. Burniston and Curtis [29]
Coker, Umeda, and Browman [18], and Klatt [19]. employs an artificial neural network alongside a rule-based
CSS methods try to achieve the naturalness and intelligi- method to model speech parameters. Ling, Deng, and Yu
bility of speech by combining chunks of recorded speech. [30] employs limited Boltzmann machines and deep belief
They can be divided into two categories: fixed inventory and networks to predict speech parameters for each HMM state.
unit-selection approaches [13]. Fixed inventory uses only one Some other methods worth noting are multi-layer perceptron
instance of each concatenative unit, which goes through signal [28, 31–34], time-delay neural network [35, 36], long short-
processing before being combined into a spoken word. An term memory [37–40], gated recurrent unit [41], attention-
example of this might be Dixon and Maxey [20], which uses based recurrent network [42], and mixture density network
the diphone method of segment assembly. On the other hand, [41, 43].
unit-selection employs a large number of concatenative units, The TTS system consists of four major components: the
which can result in a better match between adjacent units, first converts text to a linguistic representation, the second
potentially boosting speech quality. There are two fundamental determines the duration of each speech segment, the third
concepts: target cost and concatenation cost. The target cost converts the linguistic and timing representations into speech
determines how well an element from a database of units fits parameters, and the fourth is the vocoder, which generates the
the desired unit, whereas the concatenation cost indicates how speech waveform based on the speech parameters [35]. The
well a pair of selected units combine. The goal is to minimize majority of the works we presented focused on converting the
both costs for the entire sequence of units; this is commonly linguistic representation into speech parameters, but there are
done using a Viterbi search [13]. Although it is always possible also models focusing on, for example, grapheme-to-phoneme
to minimize costs, the resulting speech may still contain errors. conversion [44, 45] to allow TTS without knowledge of
This can arise owing to a lack of units to choose from, an linguistic features. Examples of vocoders include MLSA [46],
issue that can be mitigated by increasing the database size. STRAIGHT [47], and Vocaine [48]. Finally, there have also
It sounds straightforward; however, doing so increases unit been attempts to construct a fully end-to-end system, which
creation costs and search times due to the increased number means integrating text analysis and acoustic modeling into a
of possible concatenations. All of this requires CSS techniques single model [42].
to pick between speech quality and synthesis speed. Works in
this domain include Sagisaka [21], where they employ non-
uniform synthesis units, and Hunt and Black [22], which treats B. Music generation
units of a unit-selection database as states in a state transition Music has been a part of human life long before the inven-
network. tion of the electronic computer, and people have developed
SPSS models speech parameters using statistical methods many guidelines for how beautifully sounded music should be
depending on the desired phoneme sequence. This differs made. For this reason alone, the discipline of music generation
from CSS techniques in that we are not maintaining natural, has placed a heavy emphasis on rule-based systems that use
unaltered speech but rather teaching the model how to recreate music theory to create logical rules. Unlike text, musical
it. In a typical SPSS system, this is done by first extracting vocabulary is rather tiny, consisting of at most several hundred
parametric representations of speech and then modeling them discrete note symbols [49]. Music creation is classified into
using generative models, commonly by applying the maximum six categories: grammars, knowledge-based, markov chains,
likelihood criterion [23]. The primary advantage of SPSS artificial neural networks, evolutionary methods, and self-
over CSS is its ability to generalize to unknown data [13]. similarity [50]. Specific methods include discrete nonlinear
This enables us to adjust the model to generate different maps [51, 52], rule-based [53, 54], genetic algorithm [55–58],
voice characteristics [15]. It also requires orders of magnitude recurrent neural network [57, 59], long short-term memory
less memory because we use model parameters instead of a [60–63], markov chain [52, 64], context-free grammars [58,
speech database. Although there are other SPSS techniques, 64], context-sensitive grammars [65, 66], cellular automaton
the majority of research has centered on hidden Markov [67], random fields [49], L-systems [68], knowledge base [69],
models (HMM) [15]. and restricted Boltzmann machines [70]. Unlike language,
Some HMM works include Toda and Tokuda [14], which music employs a significantly smaller number of acoustic
considers not only the output probability of static and dynamic features. These include MIDI representation [56, 62], encoded
feature vectors but also the global variance (GV). Nakamura sheet music [57], binary vector of an octave [49], and piano
et al. [24] directly models speech waveforms with a trajectory roll [62, 70].
4

IV. AUDIO FEATURES weighted complex exponentials [78]. The problem emerges
Even though there have been numerous audio features used when we attempt to analyze complex audio signals; because
throughout the history of audio generation. Here we will the content of most audio signals changes over time, we can’t
describe the two most popular features, the raw waveform and use DFT to figure out how frequencies change. Instead, we
the log-mel spectrogram, but also mention features that have use STFT to apply DFT to overlapping sections of the audio
recently gained traction. Keep in mind that there are too many waveform [8]. Most techniques that use the STFT to represent
features to describe them all, especially if we take into account audio consider just its amplitude [1], which results in a lossy
the many hand-crafted features that were created before the representation. By removing the phase of the STFT, we can
rise of deep learning methods. arrange it in a time/frequency visual, creating a spectrogram.
A mel-spectrogram compresses the STFT in the frequency
A. Raw waveform axis by projecting it onto a scale known as the mel-scale
[79]. The mel-scale divides the frequency range into a set
The term "raw audio" typically refers to a waveform of mel-frequency bands, with higher frequencies having lower
recorded using pulse code modulation (PCM) [8]. In PCM, a resolution and lower frequencies having higher resolution [11].
continuous waveform is sampled at uniform intervals, known The scale was inspired by the non-linear frequency perception
as the sampling frequency. According to the sampling princi- of human hearing [11]. Applying the logarithm to the ampli-
ple, if a signal is sampled at regular intervals at a rate slightly tude results in the log-mel-spectrogram [1]. Finally, using the
higher than twice the highest signal frequency, then it will discrete cosine transform yields the mel frequency cepstral
contain all of the original signal information [71]. The average coefficients (MFCC) [3]. MFCC is a popular representation in
sample frequency for audio applications is 44.1 kHz [8], hence speech applications [13], but it was shown to be unnecessary
we cannot hold frequencies equal to or greater than 22.05 kHz. with deep learning models [1, 3].
Computers cannot store real numbers with absolute precision;
thus, each sample value is approximated by assigning it an While representations such as the STFT and raw waveform
element from a set of finite values, a technique known as are invertible, the spectrogram is not, so we must use some
quantization [8]. The most common quantization levels are approach to approximate the missing values. The algorithms
kept in 8 bits (256 levels), 16 bits (65536 levels), and 24 bits used for these were already mentioned in the previous sec-
(16.8 million levels) [8]. tion. In addition to neural-based vocoders, other algorithms
The advantage of using raw audio waveforms is that they include Griffin-Lim [80], gradient-based inversion [81], single-
can be easily transformed into actual sound. In certain tasks, pass spectrogram inversion (SPSI) [82], and phase gradient
the disadvantages appear to outweigh the benefits, as raw heap integration (PGHI) [83]. Mel-spectrograms have been
waveforms are still not universally used. The issue is that raw frequently utilized as intermediate features in text-to-speech
audio synthesis at higher bit rates becomes problematic due to pipelines [7]. Tacotron 1/2 [84, 85] and FastSpeech 1/2 [86,
the sheer amount of states involved [72]. For example, 24-bit 87] are examples of such models. To better illustrate the
audio signals have more than 16 million states. High sampling compression of the mel-spectrogram, take, for example, a 5-
rates create exceptionally long sequences, making raw audio minute video sampled at 44.1 kHz. With 16-bit depth, our raw
synthesis more challenging [73]. µ-law is frequently employed waveform will take up ≈25MB, while the mel-spectrogram
in speech generative models like WaveNet [74] to compress with a common configuration1 of 80 bins, 256 hop size,
integer values and sequence length. The method can quantize 1024 window size, and 1024 points of Fourier transform
each timestep to 256 values and reconstruct high-quality audio takes up ≈8MB at the same bit depth. In a 2018 article,
[73]. According to Dieleman, van den Oord, and Simonyan it was proven that mel-spectrogram is preferable over STFT
[75], increased bit depth representation can lead to models because it achieves the same performance while having a
learning undesirable aspects, such as the calm background of more compact representation [88]. Given the field’s progress,
the surroundings. It should be emphasized that this issue was it should be emphasized that only recurrent and convolutional
only observed in older publications and is not discussed in models were examined in the article. Another advantage of the
current ones. mel-spectrogram, and spectrograms in general, is that they can
The most common models that use raw waveforms as be displayed as images. Because it ignores phase information,
their representation of choice are text-to-speech models called it can be shown with one dimension being frequency and the
vocoders. In section III, we mentioned vocoders, which are other being time. This is useful since images have been widely
used to translate mid-term representations, such as mel- employed in computer vision tasks, allowing us to borrow
spectrograms, to raw audio waveforms. Examples include models for usage in the audio domain. There is a concern
WaveNet [74], SampleRNN [76], and Deep Voice 3 [77]. as spectrograms aren’t the same as images due to the different
meaning of the axis. This does not appear to have a substantial
effect since many works implement mel-spectrograms in their
B. Mel-spectrogram convolutional models [1].
Before we can talk about mel-spectrograms, we must first
understand the Short-Time Fourier Transform (STFT). To rep-
resent audio frequencies, we use a Discrete-Fourier Transform
(DFT), which transforms the original waveform into a sum of 1 Based on a small sample of articles observed using mel-spectrograms
5

C. Neural codecs projector component. To make the representations suitable for


transmission and storage, we further quantize the projections
An audio codec is a signal processing technique that com- into codes. These codes make up a lookup table, which is used
presses an audio signal into discrete codes before using those at the other end by the decoder to transform the quantized
codes to reconstruct the audio signal, which is not always representations back to a raw waveform. Shen et al. [93]
possible with complete accuracy. A typical audio codec system defines the neural codec as a kind of neural network model that
consists of three components: an encoder, a quantizer, and converts audio waveform into compact representations with a
a decoder. The function of each component is explained in codec encoder and reconstructs audio waveform from these
section V-A. The goal of an audio codec is to use as little in- representations with a codec decoder. The core idea is to use
formation as possible to store or transmit an audio signal while the audio codec to compress the speech or sound into a set
ensuring that the decoded audio quality is not significantly of discrete tokens, and then the generation model is used to
reduced by eliminating redundant or irrelevant information generate these tokens [94]. They have been shown to allow
from the audio signal. Traditionally, this is accomplished by for cross-modal tasks [95].
changing the signal and trading off the quality of specific Neural codec methods include SoundStream [90], EnCodec
signal components that are less likely to influence the quality [89], HiFi-Codec [94], AudioDec [92], and APCodec [91]. All
[89]. Audio codecs have been utilized for a wide range of the said methods use residual vector quantization (RVQ), while
applications, including mobile and internet communication. HiFi-Codec also introduced an extension called group-RVQ.
There are numerous types of audio codecs; some are utilized VQ methods will be talked about in section V-A. SoundStream
in real-time applications like streaming, while others may be [90] is used by AudioLM, [96], MusicLM [97], SingSong [98]
used for audio production. Whereas in streaming, latency is a and SoundStorm [99], while EnCodec [89] is used by VALL-E
larger concern, which means sacrificing quality for speed, in [73], VALL-E X [100], Speech-X [101], and VioLA [95].
production, we want to retain as much detail as possible while Finally, despite the fact that neural codec approaches are
maintaining a compact representation. relatively new, they have not been without criticism. Shen
Audio codecs can be separated into two categories: wave- et al. [93] noted that although RVQ can achieve acceptable
form codecs and parametric codecs. Waveform codecs make reconstruction quality and low bitrate, they are meant for
little to no assumptions about the nature of audio, allowing compression and transmission; therefore, they may not be
them to work with any audio signal. This universality makes suited as intermediate representations for audio production
them well-suited for creating high-quality audio at low com- jobs. This is because the sequence of discrete tokens created
pression, but they tend to produce artifacts when operating by RVQ can be very long, approximately N times longer when
at high compression [90]. Furthermore, because they do not N residual quantifiers are utilized. Because language models
operate well in high compression, they tend to increase storage cannot handle extremely long sequences, we will encounter
and transmission costs. In contrast to waveform codecs, para- inaccurate predictions of discrete tokens, resulting in word
metric codecs make assumptions about the source audio being skipping, word repetition, or speech collapse issues while
encoded and introduce strong priors in the form of a parametric attempting to reconstruct the speech waveform from these
model that characterizes the audio synthesis process. The goal tokens.
is not to achieve a faithful reconstruction on a sample-by-
sample basis but rather to generate audio that is perceptually
comparable to the original [90]. Parametric codecs offer great V. A RCHITECTURES
compression but suffer from low decoded audio quality and As the models become more advanced, they start utilizing
noise susceptibility [91]. many different architectures in unison, making it impossible
On the way to the neural codec, we first encountered hybrid to categorize them efficiently. Therefore, each subsection will
codecs, which substituted some parametric codec modules contain models that fit into many subsections but have been
with neural networks. This type of codec improves perfor- divided up in the way the author thought made the most
mance by leveraging neural networks’ adaptability. Following sense. Unlike the audio features, there are many different
that came vocoder-based approaches, which could leverage architectures. Here we will mention the architectures that have
previously introduced neural vocoders to reconstruct audio been most commonly used in the field of audio generation.
signals by conditioning them on parametric coder codes or
quantized acoustic features. However, their performance and
A. Auto-encoders
compression were still dependent on the handcrafted features
received at the input [92]. The observation that separating The majority of this section was taken from works by
models into modules prevents them from functioning effec- Goodfellow, Bengio, and Courville [102] and Kingma and
tively has inspired end-to-end auto-encoders (E2E AE) that Welling [103].
accept raw waveforms as input and output. A standard E2E The auto-encoder’s objective is to duplicate the input into
AE is made up of four basic components: an encoder, a the output. It consists of two parts: an encoder and a decoder.
projector, a quantizer, and a decoder [92]. The basic use The intersection of the two components depicts a code that
case is to take the raw waveform and use the encoder to attempts to represent both the input and, by extension, the
construct a representation with reduced temporal resolution, output. The encoder receives input data and changes it into a
which is then projected into a multidimensional space by the code that the decoder then uses to approximate the original
6

input. If we allowed arbitrary values in the encoder and de- only use the vector quantization.
coder, we would obtain no meaningful code because it would Residual Vector Quantization (RVQ) improves VAE by
simply simulate an identity function. To obtain meaningful computing the residual after quantization and further quantiz-
code, we constrain both the encoder and decoder, preventing ing it using a second codebook, a third, and so on [89]. In other
them from just passing data through. We can accomplish words, RVQ cascades N layers of VQ where unquantized
this by, for example, restricting the dimensionality of the input vector is passed through a first VQ and quantization
values in the model. The auto-encoder has the advantage residuals are computed, then those residuals are iteratively
of not requiring labeled data because it merely seeks to quantized by an additional sequence of N −1 vector quantizers
reconstruct the input, allowing for unsupervised learning. Lee [90]. Section IV-C provides a list of models that employ RVQ.
et al. [104] demonstrates a basic use case for a simple auto-
encoder, feature extraction. While VITS [105] connects two
B. Generative adversarial networks
text-to-speech modules using VAE, enabling end-to-end learn-
ing in an adversarial setting. Figure 2a depicts a simple auto- Another prominent generating architecture is generative
encoder setup using generic encoder and decoder components. adversarial networks (GAN). It is made up of two models that
As this article focuses on generation, we will now introduce serve different purposes: the generator and the discriminator.
one of the most popular forms of the auto-encoder, the The generator’s function is to convert a random input vector to
Variational Auto-Encoder (VAE) [1]. VAE has been proposed a data sample. The random vector is usually smaller because
to enable us to employ auto-encoders as generative models [8]. the generator mimics the decoder part of the auto-encoder [1].
The VAE components can be considered as a combination of Unlike VAE, which imposes a distribution to generate realistic
two separately parameterized models, the recognition model data, GAN utilizes a second network called the discriminator
and the generative model. The VAE’s success was mostly due [1]. It takes the generator’s output or a sample from the
to the choice of the Kullback-Leibler (KL) divergence as the dataset and attempts to classify it as either real or fake. The
loss function [8]. KL will also be described in section VI. generator is penalized based on the discriminator’s ability to
Unlike the auto-encoder, the VAE learns the parameters of tell the difference between real and fake. The opposite is
a probability distribution rather than a compressed represen- also true: if the discriminator is unable to distinguish between
tation of the data [5]. Modeling the probability distribution the generator and the actual data points, it is penalized as
allows us to sample from the learned data distribution. The well. In other words, the two neural networks face off in
Gaussian distribution is typically used for its generality [4]. a two-player minimax game. According to [123], the ideal
MusicVAE [106] uses a hierarchical decoder in a recurrent outcome for network training is for the discriminator to be
VAE to avoid posterior collapse. BUTTER [107] creates a uni- 50% certain whether the input is real or bogus. In practice,
fied multi-model representation learning model using VAEs. we train the generator through the discriminator by reducing
Music FaderNets [108] introduces a Guassian Mixture VAE. the probability that the sample is fake, while the discriminator
RAVE [109] employs multi-stage training, initially with repre- does the opposite for fake data and the same for real data.
sentation learning and then with adversarial fine-tuning. Tango Figure 2b illustrates the generator taking a random vector
[110] and Make-An-Audio 2 [111] generate mel-spectrograms input and the discriminator attempting to distinguish between
by using VAE in a diffusion model. real and fake samples.
Vector-Quantized VAE (VQ-VAE) is an extension of VAE This basic setup allows us to generate samples resembling
that places the latent representation in a discrete latent space. those in the dataset, but it doesn’t let us condition the
VQ-VAE changes the auto-encoder structure by introducing generation. In other words, the random vector used in the
a new component called the codebook. The most significant generator does not match the semantic features of the data
change happens between the encoder and decoder, where the [5]. Many datasets contain additional information about each
encoder’s output is used in a nearest neighbor lookup utilizing sample, such as the type of object in an image. It would
the codebook. In other words, the continuous value received be beneficial if we could use the additional information to
from the encoder is quantized and mapped onto a discrete condition the generator and generate from a subset of the
latent space that will be received by the decoder. VQ-VAE learned outputs. Conditional GAN (cGAN) induces additional
replaces the KL divergence loss with negative log likelihood, structure by including additional information into the generator
codebook, and commitment losses. One possible issue with the and discriminator inputs. The generator adds the additional
VQ-VAE is codebook collapse. This occurs when the model information to the random vector, whereas the discriminator
stops using a piece of the codebook, indicating that it is no adds it to the data to discriminate. Some of the works that
longer at full capacity. It can result in decreased likelihoods utilize cGAN are MidiNet [124], Michelsanti and Tan [125],
and inadequate reconstruction [75]. Dieleman, van den Oord, Chen et al. [126], Neekhara et al. [127], and V2RA-GAN
and Simonyan [75] proposes the argmax auto-encoder as [128].
an alternative to VQ-VAE for music generation. MelGAN Common issues with GAN include mode collapse, unstable
[112], VQVAE [113], Jukebox [114] with Hierarchical VQ- training, and a lack of an evaluation metric [2]. Mode collapse
VAE, DiscreTalk [115], FIGARO [116], Diffsound [117], occurs when the generator focuses exclusively on a few out-
and Im2Wav [118] use VQ-VAE to compress the input to puts that can trick the discriminator into thinking they are real.
a lower-dimensional space. While Dance2Music-GAN [119], Even if the generator meets the discriminator requirements,
SpeechT5 [120], VQTTS [121], and DelightfulTTS 2 [122] we cannot use it to produce more than a few examples. This
7

(a) Auto-encoder (b) Generative adversarial network


Figure 2. Two deep learning architectures that appear to have little in common until we look closer. The generator mimics the auto-encoder’s decoder, whereas
the discriminator resembles the encoder.

might happen because the discriminator is unable to force the C. Normalizing flows
generator to be diverse [123]. The Wasserstein GAN (WGAN)
is a well-known variant for addressing this problem. WGAN Although VAE and GAN were frequently utilized in audio
shifts the discriminator’s job from distinguishing between generation, neither allowed for an exact evaluation of the
real and forged data to computing the Wasserstein distance, probability density of new points [152]. Normalizing Flows
commonly known as the Earth Mover’s distance. In addition, (NF) are a family of generative models with tractable dis-
a modification to aid WGAN convergence has been proposed; tributions that enable exact density evaluation and sampling.
it uses a gradient penalty rather than weight clipping and is The network is made up of two "flows" that move in opposite
known as WGAN-GP. WGAN was used by MuseGAN [129], directions. One flow starts with a base density, which we
WaveGAN [130], TiFGAN [131], and Catch-A-Waveform call noise, and progresses to a more complex density. The
[132]. opposing flow reverses the direction, transforming the complex
density back into the base density. The movement from base
Another popular modification to the GAN architecture is
to complex is known as the generating direction, whereas
the use of deep convolutional networks known as deep con-
the reverse is known as the normalizing direction. The term
volutional GANs (DCGAN). Unlike WGAN, DCGAN only
normalizing flow refers to the notion that the normalizing di-
requires a change to the model architecture, rather than the
rection makes a complex distribution more regular, or normal.
entire training procedure, for both the generator and discrim-
The normal distribution is typically used for base density,
inator. It aims to provide a stable learning environment in
which is another reason for the name. Similar to how we layer
an unsupervised setting by applying a set of architectural
transformations in a deep neural network, we compose several
constraints [123]. DGAN is used in works such as MidiNet
simple functions to generate complex behavior. These func-
[124], WaveGAN [130], and TiFGAN [131].
tions cannot be chosen arbitrarily because the flow must be in
Furthermore, it is worth noting a simple GAN extension both directions; hence, the functions chosen must be invertible.
designed to address the issue of vanishing gradients while Using a characteristic of the invertible function composition,
simultaneously improving training stability. Least squares we can create a likelihood-based estimation of the parameters
GAN (LSGAN) improves the quality of generated samples by that we can apply to train the model. In this setup, data
altering the discriminator’s loss function. Unlike the regular generation is simple; utilizing the generative flow, we can input
GAN, LSGAN penalizes correctly classified samples much a sample from the base distribution and generate the required
more, pulling them toward the decision boundary, which complex distribution sample. It has been formally proven that
allows LSGAN to generate samples that are closer to the if you can build an arbitrarily complex transformation, you
real data [133]. Papers using LSGAN include SEGAN [134], will be able to generate any distribution under reasonable
Yamamoto, Song, and Kim [135], HiFi-GAN [136], Parallel assumptions [152].
WaveGAN [137], Fre-GAN [138], VITS [105] and V2RA-
Depending on the nature of the function employed in the
GAN [128].
flow, there can be significant performance differences that im-
There are many more modifications to GAN we haven’t pact training or inference time. Inverse Autoregressive Flows
mentioned, like the Cycle GAN [139] or the Boundary- (IAF) are a type of NF model with a specialized function that
Equilibrium GAN [140], as we tried to showcase the most allows for efficient synthesis. The transform is based on an
prevalent modifications in the field of audio generation. Works autoregressive network, which means that the current output
like MelGAN [112], GAAE [141], GGAN [142], SEANet is only determined by the current and previous input values.
[143], EATS [144], Dance2Music-GAN [119] and Musika The advantage of this transformation is that the generative
[145] use yet another type of loss called the hinge loss. flow may be computed in parallel, allowing efficient use
Finally, we’d like to mention works that were challenging of resources such as the GPU. Although IAF networks can
to categorize. They are GANSynth [146], GAN-TTS [147], be run in parallel during inference, training with maximum
RegNet [148], Audeo [149], Multi-Band MelGAN [150], likelihood estimation requires sequential processing. To allow
Multi-Singer [151] and DelightfulTTS 2 [122]. for parallel training, a probability density distillation strategy is
8

used [153, 154]. In this method, we try to transfer knowledge up of a stack of identical layers, each with two sublayers. The
from an already-trained teacher to a student model. Parallel first sublayer is multi-head self-attention, whereas the second
WaveNet [153] and ClariNet [154] are two IAF models that is a feed-forward network. In addition, each identical layer
employ this method. On the other hand, WaveGlow [155], contains a residual connection surrounding both sublayers that
FloWaveNet [156], and Glow-TTS [157] all utilize an affine follows layer normalization. Unlike the encoder, the decoder
coupling layer. Because this layer allows for both parallel employs both encoder-decoder attention and masked multi-
training and inference, they can avoid the problems associated head self-attention at the input. The encoder-decoder attention
with the former. Other efforts that are worth mentioning are is a normal multi-head attention with queries from the decoder
WaveNODE [158], which uses continuous normalizing flow, and (key, value) pairs from the encoder. Before the input is
and WaveFlow [159], which uses a dilated 2-D convolutional fed into the network, positional embedding is applied.
architecture. Transformers were primarily intended for natural language
At the end of this section, we’d like to address a problem processing but were then used for images with the vision
that can arise when using flow-based networks with audio data. transformer and lately for audio signals [162]. They have
Because audio is digitally stored in a discrete representation, revolutionized modern deep learning by providing the ability
training a continuous density model on discrete data might to model long-term sequences [72]. On the other hand, trans-
lead to poor model performance. Yoon et al. [160] presented formers are generally referred to as data-hungry since they
audio dequantization methods that can be deployed in flow- require a large amount of training data [162]. The attention
based networks and improved audio generating quality. mechanism’s quadratic complexity makes it difficult to process
long sequences [72]. To use transformers with audio, we
D. Transformer networks would convert signals into visual spectrograms and divide
them into "patches" that are then treated as separate input
The majority of the material discussed in this section comes tokens, analogous to text [162]. There are many works that
from Zhang et al. [161]. use the transformer architecture, including Music Transformer
Before we can discuss transformers, we need to talk about [163], FastSpeech [86], Wave2Midi2Wave [164], Li et al.
attention. Attention has three components: a query, keys, and [165], RobuTrans [166], Jukebox [114], AlignTTS [167],
values. We have a database of (key, value) pairs, and our Multi-Track Music Machine [168], JDI-T [169], AdaSpeech
goal is to locate all values that closely match their key with [170], FastPitch [171], Verma and Chafe [72], Controllable
the query. In the transformer, we improve on this concept by Music Transformer [172], Lee et al. [173], SpeechT5 [120],
introducing a new type of attention known as self-attention. CPS [174], FastSpeech 2 [87], FIGARO [116], HAT [175],
Self-attention includes three new functions that accept input ELMG [176], AudioLM [96], VALL-E [73], MusicLM [97],
and have learnable parameters. The functions change the input SingSong [98], SPEAR-TTS [177], AudioGen [178], VALL-
into one of the three previously mentioned components: query, E X [100], dGSLM [179], VioLA [95], MuseCoco [180],
key, and value. The prefix self refers to the fact that we utilize Im2Wav [118], AudioPaLM [181], VampNet [182], LM-VC
the same input for both the database and the query. If we were [183], UniAudio [184], and MusicGen [185].
to translate a sentence, we would expect the translation of the
nth word to be determined not only by itself but also by the LakhNES [186] and REMI [187] use Transformer-XL, an
other words in the sentence. For this example, the query would extension of the Transformer that can, in theory, encode
be the nth word and the database would be the sentence itself; arbitrary long contexts into fixed-length representations. This
if the parameters were learned correctly, we would expect is accomplished by providing a recurrence mechanism [188],
to see relevant values for translation as the output of self- wherein the preceding segment is cached for later usage as an
attention. To boost the model’s capacity to capture both short- expanded context for the subsequent segment. Furthermore, to
and long-term dependencies, we can concatenate numerous support the recurrence mechanism, it introduces an expanded
self-attention modules, each with its own set of parameters, positional encoding scheme known as relative positional en-
resulting in multi-head self-attention. Furthermore, if we want coding, which keeps positional information coherent when
to prevent the model from attending to future entries, we can reusing states. In addition to Transformer-XL, Hawthorne
use masked multi-head self-attention, which employs a mask et al. [189] and Yu et al. [190] presented Perceiver AR and
to specify which future entries we wish to ignore. Museformer as alternatives to tackle problems that require
The second essential element of the transformer is positional extended contexts.
encoding. Instead of processing a sequence one at a time, the Finally, another extension to the transformer has been suc-
self-attention in the transformer provides parallel computing. cessful for various speech tasks [191]. Convolution-augmented
The consequence is that the sequence’s order is not preserved. Transformer (Conformer) extends the Transformer by incorpo-
The prevailing method for preserving order information is to rating convolution and self-attention between two feed-forward
feed the model with additional input associated with each modules; this cascade of modules is a single Conformer block.
token. These additional inputs are known as positional en- It integrates a relative positional encoding scheme, a method
codings. Position encoding can be absolute or relative, and it adopted from the described Transformer-XL to improve gen-
can be predefined or learned during training. eralization for diverse input lengths [192]. Papers utilizing the
At last, the transformer, like an auto-encoder, has both an conformer are SpeechNet [193], A3 T [191], VQTTS [121],
encoder and a decoder. Both the encoder and decoder are made Popuri et al. [194], and SoundStorm [99].
9

E. Diffusion models Intuitively, it is simple to label an audio sample as good or


Diffusion models are generative models inspired by non- bad, real or fake, but it is much more challenging to document
equilibrium thermodynamics [11]. Diffusion models, like nor- a procedure derived from our thinking that may be used to
malizing flows, consist of two processes: forward and reverse. evaluate future audio samples. The human assessment is often
The forward process converts the data to a conventional Gaus- carried out with a defined number of listeners who listen and
sian distribution by constructing a Markov chain of diffusion rate the audio on a 1-5 Likert scale. This type of test is
steps with a predetermined noise schedule. The reverse method termed the Mean-Opinion Score (MOS) [221]. While MOS
gradually reconstructs data samples from the noise using an is used to evaluate naturalness, similarity MOS is used to
interference noise schedule. Unlike other architectures that assess how similar the generated and real samples are [222].
change the distribution of data, such as variational auto- Another metric, known as the comparative MOS, can be used
encoders and normalizing flows, diffusion models maintain to compare two systems by subtracting their MOS values. We
the dimensionality of the latent variables fixed. Because the may also calculate it by providing listeners with two audio
dimensionality of the latent variables must be fixed during the samples generated by different models and immediately judg-
iterative generation process, which can result in slow inference ing which one is better [165]. Oord et al. [153] discovered that
speed in high-dimensional spaces [195]. A potential solution preference scores from a paired comparison test, frequently
is to utilize a more compressed representation, such as a referred to as the A/B test, were more reliable than the MOS
mel-spectrogram, instead of a short-time Fourier transform. score. Many alternative human evaluation metrics have been
Papers employing diffusion models include WaveGrad [196], proposed for music; domain-specific metrics include melody,
DiffWave [197], Diff-TTS [198], Grad-TTS [199], DiffuSE groove, consonance, coherence, and integrity. The biggest
[200], FastDiff [201], CDiffuSE [202], Guided-TTS [203], disadvantage of human evaluation is that the findings cannot be
Guided-TTS 2 [204], DiffSinger [205], UNIVERSE [206], replicated exactly. This means that the concrete numbers in the
Diffsound [117], Noise2Music [207], DiffAVA [208], MeLoDy evaluation are unimportant, and only the link between them is
[209], Tango [110], SRTNet [210], MusicLDM [211], JEN-1 crucial. This stems from the inherent subjectivity of human
[212], AudioLDM [195], Bai et al. [213], ERNIE-Music [214], evaluation as well as biases or predispositions for specific
and Re-AudioLDM [215]. auditory features.
Transformer and diffusion models were the most popular
designs discussed in this article. As a result, in the final half B. Inception score
of this chapter, we will list some works that use both the trans- The Inception Score (IS) is a perceptual metric that corre-
former and diffusion models. These works include Hawthorne lates with human evaluation. The Inception score is calculated
et al. [216], Make-An-Audio 2 [111], NaturalSpeech 2 [93], by applying a pretrained Inception classifier to the generative
Grad-StyleSpeech [217], Make-An-Audio [218], AudioLDM model’s output. The IS is defined as the mean Kullback-
2 [219], and Moûsai [220]. Leibler divergence between the conditional output class label
distribution and the labels’ marginal distribution. It evaluates
VI. E VALUATION METRICS the diversity and quality of the audio outputs and prefers gen-
Evaluations could be considered the most important piece of erated samples that the model can confidently classify. With N
the puzzle. By introducing evaluations, we are able to quantify samples, the measure ranges from 1 to N . The IS is maximized
progress. We can compare, improve, and optimize our models, when the classifier is confident in every classification of the
all thanks to evaluation metrics. We will not mention domain- generated sample and each label is predicted equally often
specific evaluation metrics such as the character error rate used [130]. Normally, the Inception classifier is trained using the
in text-to-speech or the perceptual evaluation of speech quality ImageNet dataset, which may not be compatible with audio
used in speech enhancement. There are many more widely spectrograms. This will cause the classifier to be unable to
used metrics that we will not mention in the following sections. separate the data into meaningful categories, resulting in a
Some of them are: Nearest neighbor comparisons, Number low IS score. An extension to the IS called Modified Inception
of statistically-different bins, Kernel Inception Distance, and Score (mIS) measures the within-class diversity of samples in
CLIP score. addition to the IS which favors sharp and clear samples [197].

A. Human evaluation C. Fréchet distance


As humans, we are constantly exposed to various sorts The inception score is based solely on the generated sam-
of sounds, which provides us with a wealth of expertise ples, not taking into consideration the real samples [141]. The
when attempting to distinguish between real and manufac- Fréchet Inception Distance (FID) score addresses this issue by
tured audio. The audio-generating method we are seeking to comparing the generated samples with the real ones. The FID
construct is intended to trick people into thinking the sound calculates the Fréchet distance between two distributions for
is a recording rather than synthesis. As a result, who better both generated and real samples using distribution parameters
to judge the success of such systems than the ones we’re taken from the intermediate layer of the pretrained Inception
attempting to fool? Human evaluation is the gold standard for Network [141]. The lower the FID score, the higher the
assessing audio quality. The human ear is particularly sensitive perceived generation quality. It is frequently used to assess the
to irregularities, which are disruptive for the listener [146]. fidelity of generated samples in the image generation domain
10

[117]. This metric was found to correlate with perceptual VII. C ONCLUSION
quality and diversity on synthetic distributions [146]. The The development of deep learning methods has significantly
Inception Network is trained on the ImageNet dataset, which changed the field of audio generation. In this work, we have
is purpose-built for images, but this does not ensure it will presented three important parts of building a deep learning
function for spectrograms. It may be unable to classify the model for the task of audio generation. For audio representa-
spectrograms into any meaningful categories, resulting in tion, we have presented two long-standing champions and a
unsatisfactory results [142]. third up-and-comer. We explained five architectures and listed
Fréchet Audio Distance (FAD) is a perceptual metric work that implements them in the field of audio generation.
adapted from the FID for the audio domain. Unlike reference- Finally, we presented the four most common evaluations in
based metrics, FAD measures the distance between the gener- the works we examined. While the first three architectures
ated audio distribution and the real audio distribution using a mentioned above seem to have lost importance in recent years,
pretrained audio classifier that does not use reference audio the transformer and diffusion models seem to have taken
samples. The VGGish model [223] is used to extract the their place. This could be due to the popularization of large
characteristics of both generated and real audio [219]. As with language models such as ChatGPT or, in the case of the
the FID, the lower the score, the better the audio fidelity. diffusion models, diffusion-based text-to-image generators.
According to Kreuk et al. [178], the FAD correlates well with With the ever-increasing computing power and availability
human perceptions of audio quality. The FAD was found to be of large databases, it looks like the age of deep learning
robust against noise, computationally efficient, consistent with has only just begun. Just as deep learning has allowed us
human judgments, and sensitive to intra-class mode dropping to move from domain-dependent features and methods to a
[79]. Although FAD may indicate good audio quality, it does more universal solution, more recent work has attempted to
not necessarily indicate that the sample is desired or relevant move from a single task or purpose to a multi-task or even
[97]. For instance, in text-to-speech applications, low-FAD multi-modality model.
audio may be generated that does not match the input text.
According to Bińkowski et al. [147], the FAD measure is R EFERENCES
not appropriate for evaluating text-to-speech models since it [1] Geoffroy Peeters and Gaël Richard. “Deep Learning for Audio and Music”. In: Multi-
Faceted Deep Learning. Ed. by Jenny Benois-Pineau and Akka Zemmari. Cham: Springer
was created for music. While according to Zhu et al. [214], International Publishing, 2021, pp. 231–266. ISBN: 978-3-030-74477-9. DOI: 10.1007/978-
calculating the similarity between real and generated samples 3-030-74478-6_10.
[2] Yuanjun Zhao, Xianjun Xia, and Roberto Togneri. “Applications of Deep Learning to Audio
does not account for sample quality. Another similar metric Generation”. In: IEEE Circuits and Systems Magazine 19.4 (2019), pp. 19–38. ISSN: 1558-
0830. DOI: 10.1109/MCAS.2019.2945210.
called the Fréchet DeepSpeech Distance (FDSD) also uses [3] Hendrik Purwins et al. “Deep Learning for Audio Signal Processing”. In: IEEE Journal of
the Fréchet distance on audio features extracted by a speech Selected Topics in Signal Processing 13.2 (May 2019), pp. 206–219. ISSN: 1941-0484. DOI:
10.1109/JSTSP.2019.2908700.
recognition model [8]. Donahue et al. [144] found the FDSD [4] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. Deep Learning Techniques
for Music Generation – A Survey. Aug. 7, 2019. DOI: 10.48550/arXiv.1709.01620. arXiv:
to be unreliable in their use case. 1709.01620 [cs]. URL: http://arxiv.org/abs/1709.01620. preprint.
[5] M. Huzaifah and L. Wyse. Deep Generative Models for Musical Audio Synthesis. Nov. 25,
The last Fréchet metric that is important to discuss is 2020. arXiv: 2006.06426 [cs, eess, stat]. URL: http://arxiv.org/abs/2006.06426.
the Fréchet Distance (FD). Unlike the FAD, which extracts [6]
preprint.
Xu Tan et al. A Survey on Neural Speech Synthesis. July 23, 2021. DOI: 10 . 48550/ arXiv.
features using the VGGish [223] model, FD employs the 2106 . 15561. arXiv: 2106 . 15561 [cs, eess]. URL: http : / / arxiv. org / abs / 2106 . 15561.
preprint.
PANN [224] model. The model change enables the FD to [7] Zhaofeng Shi. A Survey on Audio Synthesis and Audio-Visual Multimodal Processing.
Aug. 1, 2021. DOI: 10.48550/arXiv.2108.00443. arXiv: 2108.00443 [cs, eess]. URL:
use different audio representations as input. PANN [224] uses http://arxiv.org/abs/2108.00443. preprint.
mel-spectrogram as input, whereas VGGish [223] uses raw [8] Anastasia Natsiou and Seán O’Leary. “Audio Representations for Deep Learning in Sound
Synthesis: A Review”. In: 2021 IEEE/ACS 18th International Conference on Computer
waveform. FD evaluates audio quality using an audio embed- Systems and Applications (AICCSA). 2021 IEEE/ACS 18th International Conference on
Computer Systems and Applications (AICCSA). Nov. 2021, pp. 1–8. DOI: 10 . 1109 /
ding model to measure the similarity between the embedding AICCSA53542.2021.9686838.
space of generations and that of targets [211]. [9] Siddique Latif et al. Transformers in Speech Processing: A Survey. Mar. 21, 2023. DOI:
10.48550/arXiv.2303.11607. arXiv: 2303.11607 [cs, eess]. URL: http://arxiv.org/abs/
2303.11607. preprint.
[10] Chenshuang Zhang et al. A Survey on Audio Diffusion Models: Text To Speech Synthesis and
Enhancement in Generative AI. Apr. 2, 2023. DOI: 10 . 48550 / arXiv. 2303 . 13336. arXiv:
2303.13336 [cs, eess]. URL: http://arxiv.org/abs/2303.13336. preprint.
D. Kullback-Leibler divergence [11] Ambuj Mehrish et al. “A Review of Deep Learning Techniques for Speech Processing”. In:
Information Fusion 99 (Nov. 1, 2023), p. 101869. ISSN: 1566-2535. DOI: 10.1016/j.inffus.
2023.101869.
Kullback-Leibler (KL) divergence is a reference-dependent [12] Zhen-Hua Ling et al. “Deep Learning for Acoustic Modeling in Parametric Speech Gener-
metric that computes the divergence between the generated and ation: A Systematic Review of Existing Techniques and Future Trends”. In: IEEE Signal
Processing Magazine 32.3 (May 2015), pp. 35–52. ISSN: 1558-0792. DOI: 10.1109/MSP.
reference audio distributions. It uses a pretrained classifier to 2014.2359987.
[13] Jacob Benesty, M. Mohan Sondhi, and Yiteng Arden Huang, eds. Springer Handbook of
obtain the probabilities of generated and reference samples and Speech Processing. Springer Handbooks. Berlin, Heidelberg: Springer Berlin Heidelberg,
2008. ISBN: 978-3-540-49127-9. DOI: 10.1007/978-3-540-49127-9.
then calculates the KL divergence between the distributions. [14] Tomoki Toda and Keiichi Tokuda. “A Speech Parameter Generation Algorithm Considering
The probabilities are computed over the class predictions of Global Variance for HMM-Based Speech Synthesis”. In: IEICE TRANSACTIONS on Infor-
mation and Systems E90-D.5 (May 1, 2007), pp. 816–824. ISSN: 1745-1361, 0916-8532.
the pretrained classifier. A low KL divergence score may [15] Paul Taylor. Text-to-Speech Synthesis. Cambridge: Cambridge University Press, 2009. ISBN:
978-0-521-89927-7. DOI: 10.1017/CBO9780511816338.
indicate that a generated audio sample shares concepts with [16] A. M. Liberman et al. “Minimal Rules for Synthesizing Speech”. In: The Journal of the
the given reference [212]. In music, this could indicate that the Acoustical Society of America 31.11 (Nov. 1, 1959), pp. 1490–1499. ISSN: 0001-4966. DOI:
10.1121/1.1907654.
created audio has similar acoustic characteristics [97]. While [17] J. N. Holmes, Ignatius G. Mattingly, and J. N. Shearme. “Speech Synthesis by Rule”. In:
Language and Speech 7.3 (July 1, 1964), pp. 127–143. ISSN: 0023-8309. DOI: 10 . 1177 /
the FAD measure is more related to human perception [110], 002383096400700301.
[18] C. Coker, N. Umeda, and C. Browman. “Automatic Synthesis from Ordinary English Text”.
the KL measure reflects more on the broader audio concepts In: IEEE Transactions on Audio and Electroacoustics 21.3 (June 1973), pp. 293–298. ISSN:
occurring in the sample [178]. 1558-2582. DOI: 10.1109/TAU.1973.1162458.
11

[19] D. Klatt. “Structure of a Phonological Rule Component for a Synthesis-by-Rule Program”. and Signal Processing (ICASSP). 2015 IEEE International Conference on Acoustics, Speech
In: IEEE Transactions on Acoustics, Speech, and Signal Processing 24.5 (Oct. 1976), and Signal Processing (ICASSP). Apr. 2015, pp. 4225–4229. DOI: 10.1109/ICASSP.2015.
pp. 391–398. ISSN: 0096-3518. DOI: 10.1109/TASSP.1976.1162847. 7178767.
[20] N. Dixon and H. Maxey. “Terminal Analog Synthesis of Continuous Speech Using the Di- [45] Kaisheng Yao and Geoffrey Zweig. Sequence-to-Sequence Neural Net Models for
phone Method of Segment Assembly”. In: IEEE Transactions on Audio and Electroacoustics Grapheme-to-Phoneme Conversion. Aug. 20, 2015. DOI: 10 . 48550 / arXiv . 1506 . 00196.
16.1 (Mar. 1968), pp. 40–50. ISSN: 0018-9278. DOI: 10.1109/TAU.1968.1161948. arXiv: 1506.00196 [cs]. URL: http://arxiv.org/abs/1506.00196. preprint.
[21] Y. Sagisaka. “Speech Synthesis by Rule Using an Optimal Selection of Non-Uniform [46] S. Imai. “Cepstral Analysis Synthesis on the Mel Frequency Scale”. In: ICASSP ’83. IEEE
Synthesis Units”. In: ICASSP-88., International Conference on Acoustics, Speech, and International Conference on Acoustics, Speech, and Signal Processing. ICASSP ’83. IEEE
Signal Processing. IEEE Computer Society, Jan. 1, 1988, pp. 679, 680, 681, 682–679, 680, International Conference on Acoustics, Speech, and Signal Processing. Vol. 8. Apr. 1983,
681, 682. DOI: 10.1109/ICASSP.1988.196677. pp. 93–96. DOI: 10.1109/ICASSP.1983.1172250.
[22] A.J. Hunt and A.W. Black. “Unit Selection in a Concatenative Speech Synthesis System [47] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain de Cheveigné. “Restructuring Speech
Using a Large Speech Database”. In: 1996 IEEE International Conference on Acoustics, Representations Using a Pitch-Adaptive Time–Frequency Smoothing and an Instantaneous-
Speech, and Signal Processing Conference Proceedings. 1996 IEEE International Confer- Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds1”. In:
ence on Acoustics, Speech, and Signal Processing Conference Proceedings. Vol. 1. May Speech Communication 27.3 (Apr. 1, 1999), pp. 187–207. ISSN: 0167-6393. DOI: 10.1016/
1996, 373–376 vol. 1. DOI: 10.1109/ICASSP.1996.541110. S0167-6393(98)00085-5.
[23] Heiga Zen, Keiichi Tokuda, and Alan W. Black. “Statistical Parametric Speech Synthesis”. [48] Yannis Agiomyrgiannakis. “Vocaine the Vocoder and Applications in Speech Synthesis”. In:
In: Speech Communication 51.11 (Nov. 1, 2009), pp. 1039–1064. ISSN: 0167-6393. DOI: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
10.1016/j.specom.2009.04.004. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] Kazuhiro Nakamura et al. “Integration of Spectral Feature Extraction and Modeling for Apr. 2015, pp. 4230–4234. DOI: 10.1109/ICASSP.2015.7178768.
HMM-Based Speech Synthesis”. In: IEICE Transactions on Information and Systems [49] Victor Lavrenko and Jeremy Pickens. “Polyphonic Music Modeling with Random Fields”.
E97.D.6 (2014), pp. 1438–1448. ISSN: 0916-8532, 1745-1361. DOI: 10.1587/transinf.E97. In: Proceedings of the Eleventh ACM International Conference on Multimedia. MULTI-
D.1438. MEDIA ’03. New York, NY, USA: Association for Computing Machinery, Nov. 2, 2003,
[25] Takayoshi Yoshimura et al. “Simultaneous Modeling of Spectrum, Pitch and Duration in pp. 120–129. ISBN: 978-1-58113-722-4. DOI: 10.1145/957013.957041.
HMM-based Speech Synthesis”. In: 6th European Conference on Speech Communication [50] J. D. Fernandez and F. Vico. “AI Methods in Algorithmic Composition: A Comprehensive
and Technology (Eurospeech 1999). 6th European Conference on Speech Communication Survey”. In: Journal of Artificial Intelligence Research 48 (Nov. 17, 2013), pp. 513–582.
and Technology (Eurospeech 1999). ISCA, Sept. 5, 1999, pp. 2347–2350. DOI: 10.21437/ ISSN : 1076-9757. DOI : 10.1613/jair.3908.
Eurospeech.1999-513. [51] Jeff Pressing. “Nonlinear Maps as Generators of Musical Design”. In: Computer Music
[26] Keiichi Tokuda, Heiga Zen, and Alan W. Black. “An HMM-based Speech Synthesis System Journal 12.2 (1988), pp. 35–46. ISSN: 0148-9267. DOI: 10.2307/3679940. JSTOR: 3679940.
Applied to English”. In: Proc. IEEE Workshop on Speech Synthesis, 2002 (2002). [52] Charles Dodge and Thomas A. Jerse. Computer Music: Synthesis, Composition, and Perfor-
[27] Keiichi Tokuda et al. “Speech Synthesis Based on Hidden Markov Models”. In: Proceedings mance. 1st print. London: Macmillan, 1985. 383 pp. ISBN: 978-0-02-873100-1.
of the IEEE 101.5 (May 2013), pp. 1234–1252. ISSN: 1558-2256. DOI: 10 . 1109 / JPROC . [53] Francesco Giomi and Marco Ligabue. “Computational Generation and Study of Jazz
2013.2251852. Music”. In: Interface 20.1 (Jan. 1, 1991), pp. 47–64. ISSN: 0303-3902. DOI: 10 . 1080 /
[28] Heiga Zen, Andrew Senior, and Mike Schuster. “Statistical Parametric Speech Synthesis 09298219108570576.
Using Deep Neural Networks”. In: 2013 IEEE International Conference on Acoustics, [54] Charles Ames and Michael Domino. “Cybernetic Composer: An Overview”. In: Under-
Speech and Signal Processing. 2013 IEEE International Conference on Acoustics, Speech standing Music with AI: Perspectives on Music Cognition. Cambridge, MA, USA: MIT
and Signal Processing. May 2013, pp. 7962–7966. DOI: 10.1109/ICASSP.2013.6639215. Press, Aug. 17, 1992, pp. 186–205. ISBN: 978-0-262-52170-3.
[29] J. Burniston and K.M. Curtis. “A Hybrid Neural Network/Rule Based Architecture for [55] John Biles. “GenJam: A Genetic Algorithm for Generating Jazz Solos”. In: ICMC. Vol. 94.
Diphone Speech Synthesis”. In: Proceedings of ICSIPNN ’94. International Conference on Ann Arbor, MI, 1994, pp. 131–137.
Speech, Image Processing and Neural Networks. Proceedings of ICSIPNN ’94. International [56] Artemis Moroni et al. “Vox Populi: An Interactive Evolutionary System for Algorithmic
Conference on Speech, Image Processing and Neural Networks. Apr. 1994, 323–326 vol.1. Music Composition”. In: Leonardo Music Journal 10 (Dec. 1, 2000), pp. 49–54. ISSN: 0961-
DOI : 10.1109/SIPNN.1994.344901. 1215. DOI: 10.1162/096112100570602.
[30] Zhen-Hua Ling, Li Deng, and Dong Yu. “Modeling Spectral Envelopes Using Restricted [57] C.-C.J. Chen and R. Miikkulainen. “Creating Melodies with Evolving Recurrent Neural
Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthe- Networks”. In: IJCNN’01. International Joint Conference on Neural Networks. Proceedings
sis”. In: IEEE Transactions on Audio, Speech, and Language Processing 21.10 (Oct. 2013), (Cat. No.01CH37222). IJCNN’01. International Joint Conference on Neural Networks.
pp. 2129–2139. ISSN: 1558-7924. DOI: 10.1109/TASL.2013.2269291. Proceedings (Cat. No.01CH37222). Vol. 3. July 2001, 2241–2246 vol.3. DOI: 10 . 1109 /
[31] T. Weijters and J. Thole. “Speech Synthesis with Artificial Neural Networks”. In: IEEE IJCNN.2001.938515.
International Conference on Neural Networks. IEEE International Conference on Neural [58] Alfonso Ortega de la Puente, Rafael Sánchez Alfonso, and Manuel Alfonseca Moreno.
Networks. Mar. 1993, 1764–1769 vol.3. DOI: 10.1109/ICNN.1993.298824. “Automatic Composition of Music by Means of Grammatical Evolution”. In: Proceedings of
[32] Heng Lu, Simon King, and Oliver Watts. “Combining a Vector Space Representation of the 2002 Conference on APL: Array Processing Languages: Lore, Problems, and Applica-
Linguistic Context with a Deep Neural Network for Text-To-Speech Synthesis”. In: 8th ISCA tions. APL ’02. New York, NY, USA: Association for Computing Machinery, June 1, 2002,
Speech Synthesis Workshop. 8th ISCA Speech Synthesis Workshop. Aug. 1, 2013, pp. 261– pp. 148–155. ISBN: 978-1-58113-577-0. DOI: 10.1145/602231.602249.
265. [59] MICHAEL C. MOZER. “Neural Network Music Composition by Prediction: Exploring the
[33] Yuchen Fan et al. “Multi-Speaker Modeling and Speaker Adaptation for DNN-based TTS Benefits of Psychoacoustic Constraints and Multi-scale Processing”. In: Connection Science
Synthesis”. In: 2015 IEEE International Conference on Acoustics, Speech and Signal 6.2-3 (Jan. 1, 1994), pp. 247–280. ISSN: 0954-0091. DOI: 10.1080/09540099408915726.
Processing (ICASSP). 2015 IEEE International Conference on Acoustics, Speech and Signal [60] Douglas Eck and Juergen Schmidhuber. “A First Look at Music Composition Using Lstm
Processing (ICASSP). Apr. 2015, pp. 4475–4479. DOI: 10.1109/ICASSP.2015.7178817. Recurrent Neural Networks”. In: Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale
[34] Keiichi Tokuday and Heiga Zen. “Directly Modeling Speech Waveforms by Neural Net- 103.4 (2002), pp. 48–56.
works for Statistical Parametric Speech Synthesis”. In: 2015 IEEE International Conference [61] Hang Chu, Raquel Urtasun, and Sanja Fidler. Song From PI: A Musically Plausible Network
on Acoustics, Speech and Signal Processing (ICASSP). 2015 IEEE International Conference for Pop Music Generation. Nov. 10, 2016. DOI: 10.48550/arXiv.1611.03477. arXiv: 1611.
on Acoustics, Speech and Signal Processing (ICASSP). Apr. 2015, pp. 4215–4219. DOI: 03477 [cs]. URL: http://arxiv.org/abs/1611.03477. preprint.
10.1109/ICASSP.2015.7178765. [62] Allen Huang and Raymond Wu. Deep Learning for Music. June 15, 2016. DOI: 10.48550/
[35] Orhan Karaali, Gerald Corrigan, and Ira Gerson. Speech Synthesis with Neural Networks. arXiv. 1606 . 04930. arXiv: 1606 . 04930 [cs]. URL: http : / / arxiv. org / abs / 1606 . 04930.
Nov. 24, 1998. arXiv: cs/9811031. URL: http://arxiv.org/abs/cs/9811031. preprint. preprint.
[36] Orhan Karaali et al. Text-To-Speech Conversion with Neural Networks: A Recurrent TDNN [63] Olof Mogren. C-RNN-GAN: Continuous Recurrent Neural Networks with Adversarial
Approach. Nov. 24, 1998. DOI: 10 . 48550 / arXiv . cs / 9811032. arXiv: cs / 9811032. URL: Training. Nov. 29, 2016. DOI: 10 . 48550 / arXiv. 1611 . 09904. arXiv: 1611 . 09904 [cs].
http://arxiv.org/abs/cs/9811032. preprint. URL: http://arxiv.org/abs/1611.09904. preprint.
[37] Yuchen Fan et al. “TTS Synthesis with Bidirectional LSTM Based Recurrent Neural [64] Kevin Jones. “Compositional Applications of Stochastic Processes”. In: Computer Music
Networks”. In: Interspeech 2014. Interspeech 2014. ISCA, Sept. 14, 2014, pp. 1964–1968. Journal 5.2 (1981), pp. 45–61. ISSN: 0148-9267. DOI: 10.2307/3679879. JSTOR: 3679879.
DOI : 10.21437/Interspeech.2014-443. [65] Kohonen. “A Self-Learning Musical Grammar, or ’Associative Memory of the Second
[38] Heiga Zen and Haşim Sak. “Unidirectional Long Short-Term Memory Recurrent Neural Kind’”. In: International 1989 Joint Conference on Neural Networks. International 1989
Network with Recurrent Output Layer for Low-Latency Speech Synthesis”. In: 2015 IEEE Joint Conference on Neural Networks. 1989, 1–5 vol.1. DOI: 10.1109/IJCNN.1989.118552.
International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015 IEEE [66] Teuvo Kohonen et al. “A Nonheuristic Automatic Composing Method”. In: Music and
International Conference on Acoustics, Speech and Signal Processing (ICASSP). Apr. 2015, Connectionism. Ed. by Peter M. Todd and Gareth Loy. The MIT Press, Oct. 9, 1991,
pp. 4470–4474. DOI: 10.1109/ICASSP.2015.7178816. pp. 229–242. ISBN: 978-0-262-28503-2. DOI: 10.7551/mitpress/4804.003.0020.
[39] Bo Li and Heiga Zen. “Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN [67] Eduardo Reck Miranda. “Granular Synthesis of Sounds by Means of a Cellular Automaton”.
Based Statistical Parametric Speech Synthesis”. In: Interspeech 2016. Interspeech 2016. In: Leonardo 28.4 (1995), pp. 297–300. ISSN: 0024-094X. DOI: 10.2307/1576193. JSTOR:
ISCA, Sept. 8, 2016, pp. 2468–2472. DOI: 10.21437/Interspeech.2016-172. 1576193.
[40] Keiichi Tokuda and Heiga Zen. “Directly Modeling Voiced and Unvoiced Components [68] Peter Worth and Susan Stepney. “Growing Music: Musical Interpretations of L-Systems”.
in Speech Waveforms by Neural Networks”. In: 2016 IEEE International Conference on In: Applications of Evolutionary Computing. Ed. by Franz Rothlauf et al. Berlin, Heidelberg:
Acoustics, Speech and Signal Processing (ICASSP). 2016 IEEE International Conference Springer, 2005, pp. 545–550. ISBN: 978-3-540-32003-6. DOI: 10.1007/978- 3- 540- 32003-
on Acoustics, Speech and Signal Processing (ICASSP). Mar. 2016, pp. 5640–5644. DOI: 6_56.
10.1109/ICASSP.2016.7472757. [69] Michael Chan, John Potter, and Emery Schubert. “Improving Algorithmic Music Compo-
[41] Wenfu Wang, Shuang Xu, and Bo Xu. “Gating Recurrent Mixture Density Networks for sition with Machine Learning”. In: 9th International Conference on Music Perception and
Acoustic Modeling in Statistical Parametric Speech Synthesis”. In: 2016 IEEE International Cognition. Citeseer, 2006, pp. 1848–1854.
Conference on Acoustics, Speech and Signal Processing (ICASSP). 2016 IEEE International [70] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling Temporal
Conference on Acoustics, Speech and Signal Processing (ICASSP). Mar. 2016, pp. 5520– Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Genera-
5524. DOI: 10.1109/ICASSP.2016.7472733. tion and Transcription. June 27, 2012. DOI: 10.48550/arXiv.1206.6392. arXiv: 1206.6392
[42] Wenfu Wang, Shuang Xu, and Bo Xu. “First Step Towards End-to-End Parametric TTS [cs, stat]. URL: http://arxiv.org/abs/1206.6392. preprint.
Synthesis: Generating Spectral Parameters with Neural Attention”. In: Interspeech 2016. [71] H. S. Black and J. O. Edson. “Pulse Code Modulation”. In: Transactions of the American
Interspeech 2016. ISCA, Sept. 8, 2016, pp. 2243–2247. DOI: 10.21437/Interspeech.2016- Institute of Electrical Engineers 66.1 (Jan. 1947), pp. 895–899. ISSN: 2330-9431. DOI: 10.
134. 1109/T-AIEE.1947.5059525.
[43] Heiga Zen and Andrew Senior. “Deep Mixture Density Networks for Acoustic Modeling [72] Prateek Verma and Chris Chafe. “A Generative Model for Raw Audio Using Transformer
in Statistical Parametric Speech Synthesis”. In: 2014 IEEE International Conference on Architectures”. In: 2021 24th International Conference on Digital Audio Effects (DAFx).
Acoustics, Speech and Signal Processing (ICASSP). 2014 IEEE International Conference 2021 24th International Conference on Digital Audio Effects (DAFx). Sept. 2021, pp. 230–
on Acoustics, Speech and Signal Processing (ICASSP). May 2014, pp. 3844–3848. DOI: 237. DOI: 10.23919/DAFx51585.2021.9768298.
10.1109/ICASSP.2014.6854321. [73] Chengyi Wang et al. Neural Codec Language Models Are Zero-Shot Text to Speech Synthe-
[44] Kanishka Rao et al. “Grapheme-to-Phoneme Conversion Using Long Short-Term Memory sizers. Jan. 5, 2023. DOI: 10.48550/arXiv.2301.02111. arXiv: 2301.02111 [cs, eess].
Recurrent Neural Networks”. In: 2015 IEEE International Conference on Acoustics, Speech URL: http://arxiv.org/abs/2301.02111. preprint.
12

[74] Aaron van den Oord et al. WaveNet: A Generative Model for Raw Audio. Sept. 19, 2016. ence on Acoustics, Speech and Signal Processing (ICASSP). May 2019, pp. 3956–3960.
DOI : 10.48550/arXiv.1609.03499. arXiv: 1609.03499 [cs]. URL: http://arxiv.org/abs/1609. DOI : 10.1109/ICASSP.2019.8682513.
03499. preprint. [105] Jaehyeon Kim, Jungil Kong, and Juhee Son. “Conditional Variational Autoencoder with Ad-
[75] Sander Dieleman, Aaron van den Oord, and Karen Simonyan. “The Challenge of Realistic versarial Learning for End-to-End Text-to-Speech”. In: Proceedings of the 38th International
Music Generation: Modelling Raw Audio at Scale”. In: Advances in Neural Information Conference on Machine Learning. International Conference on Machine Learning. PMLR,
Processing Systems. Vol. 31. Curran Associates, Inc., 2018. July 1, 2021, pp. 5530–5540.
[76] Soroush Mehri et al. SampleRNN: An Unconditional End-to-End Neural Audio Generation [106] Adam Roberts et al. “A Hierarchical Latent Vector Model for Learning Long-Term Structure
Model. Feb. 11, 2017. DOI: 10.48550/arXiv.1612.07837. arXiv: 1612.07837 [cs]. URL: in Music”. In: Proceedings of the 35th International Conference on Machine Learning.
http://arxiv.org/abs/1612.07837. preprint. International Conference on Machine Learning. PMLR, July 3, 2018, pp. 4364–4373.
[77] Wei Ping et al. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. [107] Yixiao Zhang et al. “BUTTER: A Representation Learning Framework for Bi-directional
Feb. 22, 2018. DOI: 10.48550/arXiv.1710.07654. arXiv: 1710.07654 [cs, eess]. URL: Music-Sentence Retrieval and Generation”. In: Proceedings of the 1st Workshop on NLP
http://arxiv.org/abs/1710.07654. preprint. for Music and Audio (NLP4MusA). NLP4MusA 2020. Ed. by Sergio Oramas et al. Online:
[78] Paolo Prandoni and Martin Vetterli. Signal Processing for Communications. EPFL Press, Association for Computational Linguistics, 2020, pp. 54–58.
Aug. 19, 2008. 400 pp. ISBN: 978-1-4200-7046-0. Google Books: tKUYSrBM4gYC. [108] Hao Hao Tan and Dorien Herremans. Music FaderNets: Controllable Music Generation
[79] Javier Nistal, Stefan Lattner, and Gaël Richard. “Comparing Representations for Audio Syn- Based On High-Level Features via Low-Level Feature Modelling. July 29, 2020. DOI: 10 .
thesis Using Generative Adversarial Networks”. In: 2020 28th European Signal Processing 48550/arXiv.2007.15474. arXiv: 2007.15474 [cs, eess, stat]. URL: http://arxiv.org/
Conference (EUSIPCO). 2020 28th European Signal Processing Conference (EUSIPCO). abs/2007.15474. preprint.
Jan. 2021, pp. 161–165. DOI: 10.23919/Eusipco47968.2020.9287799. [109] Antoine Caillon and Philippe Esling. RAVE: A Variational Autoencoder for Fast and High-
[80] D. Griffin and Jae Lim. “Signal Estimation from Modified Short-Time Fourier Transform”. Quality Neural Audio Synthesis. Dec. 15, 2021. DOI: 10.48550/arXiv.2111.05011. arXiv:
In: IEEE Transactions on Acoustics, Speech, and Signal Processing 32.2 (Apr. 1984), 2111.05011 [cs, eess]. URL: http://arxiv.org/abs/2111.05011. preprint.
pp. 236–243. ISSN: 0096-3518. DOI: 10.1109/TASSP.1984.1164317. [110] Deepanway Ghosal et al. Text-to-Audio Generation Using Instruction-Tuned LLM and Latent
[81] Rémi Decorsière et al. “Inversion of Auditory Spectrograms, Traditional Spectrograms, Diffusion Model. May 29, 2023. DOI: 10.48550/arXiv.2304.13731. arXiv: 2304.13731 [cs,
and Other Envelope Representations”. In: IEEE/ACM Transactions on Audio, Speech, and eess]. URL: http://arxiv.org/abs/2304.13731. preprint.
Language Processing 23.1 (Jan. 2015), pp. 46–56. ISSN: 2329-9304. DOI: 10.1109/TASLP. [111] Jiawei Huang et al. Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation.
2014.2367821. May 29, 2023. DOI: 10 . 48550 / arXiv . 2305 . 18474. arXiv: 2305 . 18474 [cs, eess].
[82] Gerald T. Beauregard, Mithila Harish, and Lonce Wyse. “Single Pass Spectrogram Inver- URL: http://arxiv.org/abs/2305.18474. preprint.
sion”. In: 2015 IEEE International Conference on Digital Signal Processing (DSP). 2015 [112] Kundan Kumar et al. “MelGAN: Generative Adversarial Networks for Conditional Wave-
IEEE International Conference on Digital Signal Processing (DSP). July 2015, pp. 427–431. form Synthesis”. In: Advances in Neural Information Processing Systems. Vol. 32. Curran
DOI : 10.1109/ICDSP.2015.7251907. Associates, Inc., 2019.
[83] Zdeněk Průša, Peter Balazs, and Peter Lempel Søndergaard. “A Noniterative Method for [113] Andros Tjandra et al. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec
Reconstruction of Phase From STFT Magnitude”. In: IEEE/ACM Transactions on Audio, Inverter for Zerospeech Challenge 2019. May 29, 2019. DOI: 10.48550/arXiv.1905.11449.
Speech, and Language Processing 25.5 (May 2017), pp. 1154–1164. ISSN: 2329-9304. DOI: arXiv: 1905.11449 [cs, eess]. URL: http://arxiv.org/abs/1905.11449. preprint.
10.1109/TASLP.2017.2678166. [114] Prafulla Dhariwal et al. Jukebox: A Generative Model for Music. Apr. 30, 2020. DOI: 10 .
[84] Yuxuan Wang et al. Tacotron: Towards End-to-End Speech Synthesis. Apr. 6, 2017. DOI: 10. 48550/arXiv.2005.00341. arXiv: 2005.00341 [cs, eess, stat]. URL: http://arxiv.org/
48550/arXiv.1703.10135. arXiv: 1703.10135 [cs]. URL: http://arxiv.org/abs/1703.10135. abs/2005.00341. preprint.
preprint. [115] Tomoki Hayashi and Shinji Watanabe. DiscreTalk: Text-to-Speech as a Machine Translation
[85] Jonathan Shen et al. “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Problem. May 11, 2020. DOI: 10 . 48550 / arXiv . 2005 . 05525. arXiv: 2005 . 05525 [cs,
Predictions”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal eess]. URL: http://arxiv.org/abs/2005.05525. preprint.
Processing (ICASSP). 2018 IEEE International Conference on Acoustics, Speech and Signal [116] Dimitri von Rütte et al. “FIGARO: Controllable Music Generation Using Learned and Expert
Processing (ICASSP). Apr. 2018, pp. 4779–4783. DOI: 10.1109/ICASSP.2018.8461368. Features”. In: The Eleventh International Conference on Learning Representations. Sept. 29,
[86] Yi Ren et al. “FastSpeech: Fast, Robust and Controllable Text to Speech”. In: Advances in 2022.
Neural Information Processing Systems. Vol. 32. Curran Associates, Inc., 2019. [117] Dongchao Yang et al. “Diffsound: Discrete Diffusion Model for Text-to-Sound Genera-
[87] Yi Ren et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. Aug. 7, 2022. tion”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023),
DOI : 10.48550/arXiv.2006.04558. arXiv: 2006.04558 [cs, eess]. URL : http://arxiv.org/ pp. 1720–1733. ISSN: 2329-9304. DOI: 10.1109/TASLP.2023.3268730.
abs/2006.04558. preprint. [118] Roy Sheffer and Yossi Adi. “I Hear Your True Colors: Image Guided Audio Generation”.
[88] Keunwoo Choi et al. “A Comparison of Audio Signal Preprocessing Methods for Deep In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal
Neural Networks on Music Tagging”. In: 2018 26th European Signal Processing Conference Processing (ICASSP). ICASSP 2023 - 2023 IEEE International Conference on Acoustics,
(EUSIPCO). 2018 26th European Signal Processing Conference (EUSIPCO). Sept. 2018, Speech and Signal Processing (ICASSP). June 2023, pp. 1–5. DOI: 10.1109/ICASSP49357.
pp. 1870–1874. DOI: 10.23919/EUSIPCO.2018.8553106. 2023.10096023.
[89] Alexandre Défossez et al. High Fidelity Neural Audio Compression. Oct. 24, 2022. DOI: [119] Ye Zhu et al. “Quantized GAN for Complex Music Generation from Dance Videos”.
10 . 48550 / arXiv . 2210 . 13438. arXiv: 2210 . 13438 [cs, eess, stat]. URL: http : In: Computer Vision – ECCV 2022. Ed. by Shai Avidan et al. Cham: Springer Nature
//arxiv.org/abs/2210.13438. preprint. Switzerland, 2022, pp. 182–199. ISBN: 978-3-031-19836-6. DOI: 10 . 1007 / 978 - 3 - 031 -
[90] Neil Zeghidour et al. “SoundStream: An End-to-End Neural Audio Codec”. In: IEEE/ACM 19836-6_11.
Transactions on Audio, Speech, and Language Processing 30 (2022), pp. 495–507. ISSN: [120] Junyi Ao et al. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Lan-
2329-9304. DOI: 10.1109/TASLP.2021.3129994. guage Processing. May 24, 2022. DOI: 10 . 48550 / arXiv. 2110 . 07205. arXiv: 2110 . 07205
[91] Yang Ai et al. APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum [cs, eess]. URL: http://arxiv.org/abs/2110.07205. preprint.
Encoding and Decoding. Feb. 16, 2024. arXiv: 2402 . 10533 [cs, eess]. URL: http : / / [121] Chenpeng Du et al. VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ
arxiv.org/abs/2402.10533. preprint. Acoustic Feature. June 30, 2022. DOI: 10 . 48550 / arXiv. 2204 . 00768. arXiv: 2204 . 00768
[92] Yi-Chiao Wu et al. “AudioDec: An Open-source Streaming High-fidelity Neural Audio [cs, eess]. URL: http://arxiv.org/abs/2204.00768. preprint.
Codec”. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and [122] Yanqing Liu et al. DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-
Signal Processing (ICASSP). June 4, 2023, pp. 1–5. DOI: 10 . 1109 / ICASSP49357 . 2023 . Quantized Auto-Encoders. July 11, 2022. DOI: 10.48550/arXiv.2207.04646. arXiv: 2207.
10096509. arXiv: 2305.16608 [eess]. 04646 [cs, eess]. URL: http://arxiv.org/abs/2207.04646. preprint.
[93] Kai Shen et al. NaturalSpeech 2: Latent Diffusion Models Are Natural and Zero-Shot Speech [123] Norberto Torres-Reyes and Shahram Latifi. “Audio Enhancement and Synthesis Using
and Singing Synthesizers. May 30, 2023. DOI: 10.48550/arXiv.2304.09116. arXiv: 2304. Generative Adversarial Networks: A Survey”. In: International Journal of Computer Appli-
09116 [cs, eess]. URL: http://arxiv.org/abs/2304.09116. preprint. cations 182.35 (Jan. 17, 2019), pp. 27–31. ISSN: 09758887. DOI: 10.5120/ijca2019918334.
[94] Dongchao Yang et al. HiFi-Codec: Group-residual Vector Quantization for High Fidelity [124] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. MidiNet: A Convolutional Generative
Audio Codec. May 7, 2023. DOI: 10.48550/arXiv.2305.02765. arXiv: 2305.02765 [cs, Adversarial Network for Symbolic-domain Music Generation. July 18, 2017. DOI: 10.48550/
eess]. URL: http://arxiv.org/abs/2305.02765. preprint. arXiv. 1703 . 10847. arXiv: 1703 . 10847 [cs]. URL: http : / / arxiv. org / abs / 1703 . 10847.
[95] Tianrui Wang et al. VioLA: Unified Codec Language Models for Speech Recognition, preprint.
Synthesis, and Translation. May 25, 2023. DOI: 10 . 48550 / arXiv . 2305 . 16107. arXiv: [125] Daniel Michelsanti and Zheng-Hua Tan. “Conditional Generative Adversarial Networks
2305.16107 [cs, eess]. URL: http://arxiv.org/abs/2305.16107. preprint. for Speech Enhancement and Noise-Robust Speaker Verification”. In: Interspeech 2017.
[96] Zalán Borsos et al. “AudioLM: A Language Modeling Approach to Audio Generation”. In: Aug. 20, 2017, pp. 2008–2012. DOI: 10.21437/Interspeech.2017- 1620. arXiv: 1709.01703
IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), pp. 2523– [cs, eess, stat].
2533. ISSN: 2329-9304. DOI: 10.1109/TASLP.2023.3288409. [126] Lele Chen et al. “Deep Cross-Modal Audio-Visual Generation”. In: Proceedings of the on
[97] Andrea Agostinelli et al. MusicLM: Generating Music From Text. Jan. 26, 2023. DOI: 10 . Thematic Workshops of ACM Multimedia 2017. Thematic Workshops ’17. New York, NY,
48550/arXiv.2301.11325. arXiv: 2301.11325 [cs, eess]. URL: http://arxiv.org/abs/ USA: Association for Computing Machinery, Oct. 23, 2017, pp. 349–357. ISBN: 978-1-
2301.11325. preprint. 4503-5416-5. DOI: 10.1145/3126686.3126723.
[98] Chris Donahue et al. SingSong: Generating Musical Accompaniments from Singing. Jan. 29, [127] Paarth Neekhara et al. Expediting TTS Synthesis with Adversarial Vocoding. July 25, 2019.
2023. DOI: 10 . 48550 / arXiv. 2301 . 12662. arXiv: 2301 . 12662 [cs, eess]. URL: http : DOI : 10.48550/arXiv.1904.07944. arXiv: 1904.07944 [cs, eess]. URL: http://arxiv.org/
//arxiv.org/abs/2301.12662. preprint. abs/1904.07944. preprint.
[99] Zalán Borsos et al. SoundStorm: Efficient Parallel Audio Generation. May 16, 2023. DOI: [128] Shiguang Liu, Sijia Li, and Haonan Cheng. “Towards an End-to-End Visual-to-Raw-
10.48550/arXiv.2305.09636. arXiv: 2305.09636 [cs, eess]. URL: http://arxiv.org/abs/ Audio Generation With GAN”. In: IEEE Transactions on Circuits and Systems for Video
2305.09636. preprint. Technology 32.3 (Mar. 2022), pp. 1299–1312. ISSN: 1558-2205. DOI: 10 . 1109 / TCSVT.
[100] Ziqiang Zhang et al. Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural 2021.3079897.
Codec Language Modeling. Mar. 7, 2023. DOI: 10.48550/arXiv.2303.03926. arXiv: 2303. [129] Hao-Wen Dong et al. “MuseGAN: Multi-track Sequential Generative Adversarial Networks
03926 [cs, eess]. URL: http://arxiv.org/abs/2303.03926. preprint. for Symbolic Music Generation and Accompaniment”. In: Proceedings of the AAAI Confer-
[101] Xiaofei Wang et al. SpeechX: Neural Codec Language Model as a Versatile Speech ence on Artificial Intelligence 32.1 (1 Apr. 25, 2018). ISSN: 2374-3468. DOI: 10.1609/aaai.
Transformer. Aug. 13, 2023. DOI: 10.48550/arXiv.2308.06873. arXiv: 2308.06873 [cs, v32i1.11312.
eess]. URL: http://arxiv.org/abs/2308.06873. preprint. [130] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial Audio Synthesis. Feb. 8,
[102] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. Cambridge, Mass: 2019. DOI: 10.48550/arXiv.1802.04208. arXiv: 1802.04208 [cs]. URL: http://arxiv.org/
The MIT Press, Nov. 18, 2016. 800 pp. ISBN: 978-0-262-03561-3. abs/1802.04208. preprint.
[103] Diederik P. Kingma and Max Welling. “An Introduction to Variational Autoencoders”. In: [131] Andrés Marafioti et al. “Adversarial Generation of Time-Frequency Features with Ap-
Foundations and Trends® in Machine Learning 12.4 (Nov. 27, 2019), pp. 307–392. ISSN: plication in Audio Synthesis”. In: Proceedings of the 36th International Conference on
1935-8237, 1935-8245. DOI: 10.1561/2200000056. Machine Learning. International Conference on Machine Learning. PMLR, May 24, 2019,
[104] Hu-Cheng Lee et al. “Audio Feature Generation for Missing Modality Problem in Video pp. 4352–4362.
Action Recognition”. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, [132] Gal Greshler, Tamar Shaham, and Tomer Michaeli. “Catch-A-Waveform: Learning to Gen-
Speech and Signal Processing (ICASSP). ICASSP 2019 - 2019 IEEE International Confer- erate Audio from a Single Short Example”. In: Advances in Neural Information Processing
Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 20916–20928.
13

[133] Xudong Mao et al. Least Squares Generative Adversarial Networks. Apr. 5, 2017. DOI: 10. [162] Khalid Zaman et al. “A Survey of Audio Classification Using Deep Learning”. In: IEEE
48550/arXiv.1611.04076. arXiv: 1611.04076 [cs]. URL: http://arxiv.org/abs/1611.04076. Access 11 (2023), pp. 106620–106649. ISSN: 2169-3536. DOI: 10 . 1109 / ACCESS . 2023 .
preprint. 3318015.
[134] Santiago Pascual, Antonio Bonafonte, and Joan Serrà. SEGAN: Speech Enhancement [163] Cheng-Zhi Anna Huang et al. Music Transformer. Dec. 12, 2018. DOI: 10.48550/arXiv.1809.
Generative Adversarial Network. June 9, 2017. DOI: 10.48550/arXiv.1703.09452. arXiv: 04281. arXiv: 1809.04281 [cs, eess, stat]. URL: http://arxiv.org/abs/1809.04281.
1703.09452 [cs]. URL: http://arxiv.org/abs/1703.09452. preprint. preprint.
[135] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. Probability Density Distillation with [164] Curtis Hawthorne et al. Enabling Factorized Piano Music Modeling and Generation with the
Generative Adversarial Networks for High-Quality Parallel Waveform Generation. Aug. 27, MAESTRO Dataset. Jan. 17, 2019. DOI: 10.48550/arXiv.1810.12247. arXiv: 1810.12247
2019. DOI: 10 . 48550 / arXiv. 1904 . 04472. arXiv: 1904 . 04472 [cs, eess]. URL: http : [cs, eess, stat]. URL: http://arxiv.org/abs/1810.12247. preprint.
//arxiv.org/abs/1904.04472. preprint. [165] Naihan Li et al. “Neural Speech Synthesis with Transformer Network”. In: Proceedings of
[136] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. “HiFi-GAN: Generative Adversarial Net- the AAAI Conference on Artificial Intelligence 33.01 (01 July 17, 2019), pp. 6706–6713.
works for Efficient and High Fidelity Speech Synthesis”. In: Advances in Neural Information ISSN : 2374-3468. DOI : 10.1609/aaai.v33i01.33016706.
Processing Systems. Vol. 33. Curran Associates, Inc., 2020, pp. 17022–17033. [166] Naihan Li et al. “RobuTrans: A Robust Transformer-Based Text-to-Speech Model”. In:
[137] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. “Parallel Wavegan: A Fast Waveform Proceedings of the AAAI Conference on Artificial Intelligence 34.05 (05 Apr. 3, 2020),
Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spec- pp. 8228–8235. ISSN: 2374-3468. DOI: 10.1609/aaai.v34i05.6337.
trogram”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech [167] Zhen Zeng et al. “Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit
and Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on Alignment”. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
Acoustics, Speech and Signal Processing (ICASSP). May 2020, pp. 6199–6203. DOI: 10 . and Signal Processing (ICASSP). ICASSP 2020 - 2020 IEEE International Conference on
1109/ICASSP40776.2020.9053795. Acoustics, Speech and Signal Processing (ICASSP). May 2020, pp. 6714–6718. DOI: 10 .
[138] Ji-Hoon Kim et al. Fre-GAN: Adversarial Frequency-consistent Audio Synthesis. June 14, 1109/ICASSP40776.2020.9054119.
2021. DOI: 10 . 48550 / arXiv. 2106 . 02297. arXiv: 2106 . 02297 [cs, eess]. URL: http : [168] Jeff Ens and Philippe Pasquier. MMM : Exploring Conditional Multi-Track Music Gener-
//arxiv.org/abs/2106.02297. preprint. ation with the Transformer. Aug. 20, 2020. DOI: 10 . 48550 / arXiv . 2008 . 06048. arXiv:
[139] Wangli Hao, Zhaoxiang Zhang, and He Guan. “CMCGAN: A Uniform Framework for 2008.06048 [cs]. URL: http://arxiv.org/abs/2008.06048. preprint.
Cross-Modal Visual-Audio Mutual Generation”. In: Proceedings of the AAAI Conference [169] Dan Lim et al. JDI-T: Jointly Trained Duration Informed Transformer for Text-To-Speech
on Artificial Intelligence 32.1 (1 Apr. 27, 2018). ISSN: 2374-3468. DOI: 10.1609/aaai.v32i1. without Explicit Alignment. Oct. 4, 2020. DOI: 10.48550/arXiv.2005.07799. arXiv: 2005.
12329. 07799 [cs, eess]. URL: http://arxiv.org/abs/2005.07799. preprint.
[140] Jen-Yu Liu et al. “Unconditional Audio Generation with Generative Adversarial Networks [170] Mingjian Chen et al. AdaSpeech: Adaptive Text to Speech for Custom Voice. Mar. 1, 2021.
and Cycle Regularization”. In: Interspeech 2020. Oct. 25, 2020, pp. 1997–2001. DOI: 10 . DOI : 10.48550/arXiv.2103.00993. arXiv: 2103.00993 [cs, eess]. URL: http://arxiv.org/
21437/Interspeech.2020-1137. arXiv: 2005.08526 [cs, eess]. abs/2103.00993. preprint.
[141] Kazi Nazmul Haque, Rajib Rana, and Björn W. Schuller. “High-Fidelity Audio Generation [171] Adrian Łańcucki. “Fastpitch: Parallel Text-to-Speech with Pitch Prediction”. In: ICASSP
and Representation Learning With Guided Adversarial Autoencoder”. In: IEEE Access 8 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing
(2020), pp. 223509–223528. ISSN: 2169-3536. DOI: 10.1109/ACCESS.2020.3040797. (ICASSP). ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and
[142] Kazi Nazmul Haque et al. “Guided Generative Adversarial Neural Network for Represen- Signal Processing (ICASSP). June 2021, pp. 6588–6592. DOI: 10.1109/ICASSP39728.2021.
tation Learning and Audio Generation Using Fewer Labelled Audio Data”. In: IEEE/ACM 9413889.
Transactions on Audio, Speech, and Language Processing 29 (2021), pp. 2575–2590. ISSN: [172] Shangzhe Di et al. “Video Background Music Generation with Controllable Music Trans-
2329-9304. DOI: 10.1109/TASLP.2021.3098764. former”. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21.
[143] Marco Tagliasacchi et al. SEANet: A Multi-modal Speech Enhancement Network. Oct. 1, New York, NY, USA: Association for Computing Machinery, Oct. 17, 2021, pp. 2037–2045.
2020. DOI: 10 . 48550 / arXiv. 2009 . 02095. arXiv: 2009 . 02095 [cs, eess]. URL: http : ISBN : 978-1-4503-8651-7. DOI : 10.1145/3474085.3475195.
//arxiv.org/abs/2009.02095. preprint. [173] Ann Lee et al. Direct Speech-to-Speech Translation with Discrete Units. Mar. 21, 2022. DOI:
[144] Jeff Donahue et al. End-to-End Adversarial Text-to-Speech. Mar. 17, 2021. DOI: 10.48550/ 10.48550/arXiv.2107.05604. arXiv: 2107.05604 [cs, eess]. URL: http://arxiv.org/abs/
arXiv. 2006 . 03575. arXiv: 2006 . 03575 [cs, eess]. URL: http : / / arxiv. org / abs / 2006 . 2107.05604. preprint.
03575. preprint. [174] Weipeng Wang et al. “CPS: Full-Song and Style-Conditioned Music Generation with Linear
[145] Marco Pasini and Jan Schlüter. Musika! Fast Infinite Waveform Music Generation. Aug. 18, Transformer”. In: 2022 IEEE International Conference on Multimedia and Expo Work-
2022. DOI: 10 . 48550 / arXiv. 2208 . 08706. arXiv: 2208 . 08706 [cs, eess]. URL: http : shops (ICMEW). 2022 IEEE International Conference on Multimedia and Expo Workshops
//arxiv.org/abs/2208.08706. preprint. (ICMEW). July 2022, pp. 1–6. DOI: 10.1109/ICMEW56448.2022.9859286.
[146] Jesse Engel et al. GANSynth: Adversarial Neural Audio Synthesis. Apr. 14, 2019. DOI: 10. [175] Xueyao Zhang et al. “Structure-Enhanced Pop Music Generation via Harmony-Aware
48550/arXiv.1902.08710. arXiv: 1902.08710 [cs, eess, stat]. URL: http://arxiv.org/ Learning”. In: Proceedings of the 30th ACM International Conference on Multimedia.
abs/1902.08710. preprint. MM ’22. New York, NY, USA: Association for Computing Machinery, Oct. 10, 2022,
[147] Mikołaj Bińkowski et al. High Fidelity Speech Synthesis with Adversarial Networks. pp. 1204–1213. ISBN: 978-1-4503-9203-7. DOI: 10.1145/3503161.3548084.
Sept. 26, 2019. DOI: 10 . 48550 / arXiv . 1909 . 11646. arXiv: 1909 . 11646 [cs, eess]. [176] Chunhui Bao and Qianru Sun. “Generating Music With Emotions”. In: IEEE Transactions
URL: http://arxiv.org/abs/1909.11646. preprint. on Multimedia 25 (2023), pp. 3602–3614. ISSN: 1941-0077. DOI: 10 . 1109 / TMM . 2022 .
[148] Peihao Chen et al. “Generating Visually Aligned Sound From Videos”. In: IEEE Transac- 3163543.
tions on Image Processing 29 (2020), pp. 8292–8302. ISSN: 1941-0042. DOI: 10.1109/TIP. [177] Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with
2020.3009820. Minimal Supervision. Feb. 7, 2023. DOI: 10.48550/arXiv.2302.03540. arXiv: 2302.03540
[149] Kun Su, Xiulong Liu, and Eli Shlizerman. “Audeo: Audio Generation for a Silent Perfor- [cs, eess]. URL: http://arxiv.org/abs/2302.03540. preprint.
mance Video”. In: Advances in Neural Information Processing Systems. Vol. 33. Curran [178] Felix Kreuk et al. AudioGen: Textually Guided Audio Generation. Mar. 5, 2023. DOI: 10 .
Associates, Inc., 2020, pp. 3325–3337. 48550/arXiv.2209.15352. arXiv: 2209.15352 [cs, eess]. URL: http://arxiv.org/abs/
[150] Geng Yang et al. “Multi-Band Melgan: Faster Waveform Generation For High-Quality Text- 2209.15352. preprint.
To-Speech”. In: 2021 IEEE Spoken Language Technology Workshop (SLT). 2021 IEEE [179] Tu Anh Nguyen et al. “Generative Spoken Dialogue Language Modeling”. In: Transactions
Spoken Language Technology Workshop (SLT). Jan. 2021, pp. 492–498. DOI: 10 . 1109 / of the Association for Computational Linguistics 11 (Mar. 14, 2023), pp. 250–266. ISSN:
SLT48900.2021.9383551. 2307-387X. DOI: 10.1162/tacl_a_00545.
[151] Rongjie Huang et al. “Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large- [180] Peiling Lu et al. MuseCoco: Generating Symbolic Music from Text. May 31, 2023. DOI:
Scale Corpus”. In: Proceedings of the 29th ACM International Conference on Multimedia. 10.48550/arXiv.2306.00110. arXiv: 2306.00110 [cs, eess]. URL: http://arxiv.org/abs/
MM ’21. New York, NY, USA: Association for Computing Machinery, Oct. 17, 2021, 2306.00110. preprint.
pp. 3945–3954. ISBN: 978-1-4503-8651-7. DOI: 10.1145/3474085.3475437. [181] Paul K. Rubenstein et al. AudioPaLM: A Large Language Model That Can Speak and Listen.
[152] Ivan Kobyzev, Simon J. D. Prince, and Marcus A. Brubaker. “Normalizing Flows: An June 22, 2023. DOI: 10 . 48550 / arXiv . 2306 . 12925. arXiv: 2306 . 12925 [cs, eess,
Introduction and Review of Current Methods”. In: IEEE Transactions on Pattern Analysis stat]. URL: http://arxiv.org/abs/2306.12925. preprint.
and Machine Intelligence 43.11 (Nov. 1, 2021), pp. 3964–3979. ISSN: 0162-8828, 2160- [182] Hugo Flores Garcia et al. VampNet: Music Generation via Masked Acoustic Token Modeling.
9292, 1939-3539. DOI: 10.1109/TPAMI.2020.2992934. arXiv: 1908.09257 [cs, stat]. July 12, 2023. DOI: 10.48550/arXiv.2307.04686. arXiv: 2307.04686 [cs, eess]. URL:
[153] Aaron Oord et al. “Parallel WaveNet: Fast High-Fidelity Speech Synthesis”. In: Proceedings http://arxiv.org/abs/2307.04686. preprint.
of the 35th International Conference on Machine Learning. International Conference on [183] Zhichao Wang et al. LM-VC: Zero-shot Voice Conversion via Speech Generation Based on
Machine Learning. PMLR, July 3, 2018, pp. 3918–3926. Language Models. Aug. 20, 2023. DOI: 10.48550/arXiv.2306.10521. arXiv: 2306.10521
[154] Wei Ping, Kainan Peng, and Jitong Chen. ClariNet: Parallel Wave Generation in End-to-End [cs, eess]. URL: http://arxiv.org/abs/2306.10521. preprint.
Text-to-Speech. Feb. 21, 2019. DOI: 10.48550/arXiv.1807.07281. arXiv: 1807.07281 [cs, [184] Dongchao Yang et al. UniAudio: An Audio Foundation Model Toward Universal Audio
eess]. URL: http://arxiv.org/abs/1807.07281. preprint. Generation. Dec. 11, 2023. DOI: 10 . 48550 / arXiv. 2310 . 00704. arXiv: 2310 . 00704 [cs,
[155] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. “Waveglow: A Flow-based Generative eess]. URL: http://arxiv.org/abs/2310.00704. preprint.
Network for Speech Synthesis”. In: ICASSP 2019 - 2019 IEEE International Conference on [185] Jade Copet et al. “Simple and Controllable Music Generation”. In: Advances in Neural
Acoustics, Speech and Signal Processing (ICASSP). ICASSP 2019 - 2019 IEEE International Information Processing Systems 36 (Dec. 15, 2023), pp. 47704–47720.
Conference on Acoustics, Speech and Signal Processing (ICASSP). May 2019, pp. 3617– [186] Chris Donahue et al. LakhNES: Improving Multi-Instrumental Music Generation with Cross-
3621. DOI: 10.1109/ICASSP.2019.8683143. Domain Pre-Training. July 10, 2019. DOI: 10.48550/arXiv.1907.04868. arXiv: 1907.04868
[156] Sungwon Kim et al. FloWaveNet : A Generative Flow for Raw Audio. May 20, 2019. DOI: [cs, eess, stat]. URL: http://arxiv.org/abs/1907.04868. preprint.
10.48550/arXiv.1811.02155. arXiv: 1811.02155 [cs, eess]. URL: http://arxiv.org/abs/ [187] Yu-Siang Huang and Yi-Hsuan Yang. “Pop Music Transformer: Beat-based Modeling and
1811.02155. preprint. Generation of Expressive Pop Piano Compositions”. In: Proceedings of the 28th ACM
[157] Jaehyeon Kim et al. “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic International Conference on Multimedia. MM ’20. New York, NY, USA: Association for
Alignment Search”. In: Advances in Neural Information Processing Systems. Vol. 33. Curran Computing Machinery, Oct. 12, 2020, pp. 1180–1188. ISBN: 978-1-4503-7988-5. DOI: 10.
Associates, Inc., 2020, pp. 8067–8077. 1145/3394171.3413671.
[158] Hyeongju Kim et al. WaveNODE: A Continuous Normalizing Flow for Speech Synthesis. [188] Zihang Dai et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length
July 2, 2020. DOI: 10.48550/arXiv.2006.04598. arXiv: 2006.04598 [cs, eess]. URL: Context. June 2, 2019. arXiv: 1901 . 02860 [cs, stat]. URL: http : / / arxiv . org / abs /
http://arxiv.org/abs/2006.04598. preprint. 1901.02860. preprint.
[159] Wei Ping et al. “WaveFlow: A Compact Flow-based Model for Raw Audio”. In: Proceedings [189] Curtis Hawthorne et al. “General-Purpose, Long-Context Autoregressive Modeling with
of the 37th International Conference on Machine Learning. International Conference on Perceiver AR”. In: Proceedings of the 39th International Conference on Machine Learning.
Machine Learning. PMLR, Nov. 21, 2020, pp. 7706–7716. International Conference on Machine Learning. PMLR, June 28, 2022, pp. 8535–8558.
[160] Hyun-Wook Yoon et al. Audio Dequantization for High Fidelity Audio Generation in Flow- [190] Botao Yu et al. “Museformer: Transformer with Fine- and Coarse-Grained Attention for
based Neural Vocoder. Aug. 16, 2020. DOI: 10.48550/arXiv.2008.06867. arXiv: 2008.06867 Music Generation”. In: Advances in Neural Information Processing Systems 35 (Dec. 6,
[cs, eess]. URL: http://arxiv.org/abs/2008.06867. preprint. 2022), pp. 1376–1388.
[161] Aston Zhang et al. Dive into Deep Learning. Aug. 22, 2023. DOI: 10 . 48550 / arXiv. 2106 . [191] He Bai et al. “A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthe-
11342. arXiv: 2106.11342 [cs]. URL: http://arxiv.org/abs/2106.11342. preprint. sis and Editing”. In: Proceedings of the 39th International Conference on Machine Learning.
International Conference on Machine Learning. PMLR, June 28, 2022, pp. 1399–1411.
14

[192] Anmol Gulati et al. Conformer: Convolution-augmented Transformer for Speech Recogni- [221] Soumi Maiti et al. “Speechlmscore: Evaluating Speech Generation Using Speech Language
tion. May 16, 2020. arXiv: 2005 . 08100 [cs, eess]. URL: http : / / arxiv. org / abs / 2005 . Model”. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech
08100. preprint. and Signal Processing (ICASSP). ICASSP 2023 - 2023 IEEE International Conference on
[193] Yi-Chen Chen et al. SpeechNet: A Universal Modularized Model for Speech Processing Acoustics, Speech and Signal Processing (ICASSP). June 2023, pp. 1–5. DOI: 10 . 1109 /
Tasks. May 31, 2021. DOI: 10.48550/arXiv.2105.03070. arXiv: 2105.03070 [cs, eess]. ICASSP49357.2023.10095710.
URL : http://arxiv.org/abs/2105.03070. preprint. [222] Sercan Arik et al. “Neural Voice Cloning with a Few Samples”. In: Advances in Neural
[194] Sravya Popuri et al. Enhanced Direct Speech-to-Speech Translation Using Self-supervised Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018.
Pre-training and Data Augmentation. Sept. 13, 2022. DOI: 10 . 48550 / arXiv. 2204 . 02967. [223] Shawn Hershey et al. “CNN Architectures for Large-Scale Audio Classification”. In: 2017
arXiv: 2204.02967 [cs, eess]. URL: http://arxiv.org/abs/2204.02967. preprint. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017
[195] Haohe Liu et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. Sept. 9, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Mar.
2023. DOI: 10 . 48550 / arXiv. 2301 . 12503. arXiv: 2301 . 12503 [cs, eess]. URL: http : 2017, pp. 131–135. DOI: 10.1109/ICASSP.2017.7952132.
//arxiv.org/abs/2301.12503. preprint. [224] Qiuqiang Kong et al. “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio
[196] Nanxin Chen et al. WaveGrad: Estimating Gradients for Waveform Generation. Oct. 9, 2020. Pattern Recognition”. In: IEEE/ACM Transactions on Audio, Speech, and Language Pro-
DOI : 10.48550/arXiv.2009.00713. arXiv: 2009.00713 [cs, eess, stat]. URL : http: cessing 28 (2020), pp. 2880–2894. ISSN: 2329-9304. DOI: 10.1109/TASLP.2020.3030497.
//arxiv.org/abs/2009.00713. preprint.
[197] Zhifeng Kong et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis. Mar. 30,
2021. DOI: 10.48550/arXiv.2009.09761. arXiv: 2009.09761 [cs, eess, stat]. URL:
http://arxiv.org/abs/2009.09761. preprint.
[198] Myeonghun Jeong et al. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. Apr. 3,
2021. DOI: 10 . 48550 / arXiv. 2104 . 01409. arXiv: 2104 . 01409 [cs, eess]. URL: http :
//arxiv.org/abs/2104.01409. preprint.
[199] Vadim Popov et al. “Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech”.
In: Proceedings of the 38th International Conference on Machine Learning. International
Conference on Machine Learning. PMLR, July 1, 2021, pp. 8599–8608.
[200] Yen-Ju Lu, Yu Tsao, and Shinji Watanabe. “A Study on Speech Enhancement Based on
Diffusion Probabilistic Model”. In: 2021 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC). 2021 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (APSIPA ASC). Dec.
2021, pp. 659–666.
[201] Rongjie Huang et al. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis. Apr. 21, 2022. DOI: 10 . 48550 / arXiv. 2204 . 09934. arXiv: 2204 . 09934 [cs,
eess]. URL: http://arxiv.org/abs/2204.09934. preprint.
[202] Yen-Ju Lu et al. “Conditional Diffusion Probabilistic Model for Speech Enhancement”.
In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). May 2022, pp. 7402–7406. DOI: 10 . 1109 /
ICASSP43922.2022.9746901.
[203] Heeseung Kim, Sungwon Kim, and Sungroh Yoon. “Guided-TTS: A Diffusion Model
for Text-to-Speech via Classifier Guidance”. In: Proceedings of the 39th International
Conference on Machine Learning. International Conference on Machine Learning. PMLR,
June 28, 2022, pp. 11119–11133.
[204] Sungwon Kim, Heeseung Kim, and Sungroh Yoon. Guided-TTS 2: A Diffusion Model for
High-quality Adaptive Text-to-Speech with Untranscribed Data. May 30, 2022. DOI: 10 .
48550/arXiv.2205.15370. arXiv: 2205.15370 [cs, eess]. URL: http://arxiv.org/abs/
2205.15370. preprint.
[205] Jinglin Liu et al. “DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism”.
In: Proceedings of the AAAI Conference on Artificial Intelligence 36.10 (10 June 28, 2022),
pp. 11020–11028. ISSN: 2374-3468. DOI: 10.1609/aaai.v36i10.21350.
[206] Joan Serrà et al. Universal Speech Enhancement with Score-based Diffusion. Sept. 16, 2022.
DOI : 10.48550/arXiv.2206.03065. arXiv: 2206.03065 [cs, eess]. URL: http://arxiv.org/
abs/2206.03065. preprint.
[207] Qingqing Huang et al. Noise2Music: Text-conditioned Music Generation with Diffusion
Models. Mar. 6, 2023. DOI: 10.48550/arXiv.2302.03917. arXiv: 2302.03917 [cs, eess].
URL: http://arxiv.org/abs/2302.03917. preprint.
[208] Shentong Mo, Jing Shi, and Yapeng Tian. DiffAVA: Personalized Text-to-Audio Generation
with Visual Alignment. May 22, 2023. DOI: 10.48550/arXiv.2305.12903. arXiv: 2305.12903
[cs]. URL: http://arxiv.org/abs/2305.12903. preprint.
[209] Max W. Y. Lam et al. Efficient Neural Music Generation. May 25, 2023. DOI: 10 . 48550 /
arXiv. 2305 . 15719. arXiv: 2305 . 15719 [cs, eess]. URL: http : / / arxiv. org / abs / 2305 .
15719. preprint.
[210] Zhibin Qiu et al. “SRTNET: Time Domain Speech Enhancement via Stochastic Refinement”.
In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). ICASSP 2023 - 2023 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). June 2023, pp. 1–5. DOI: 10.1109/ICASSP49357.
2023.10095850.
[211] Ke Chen et al. MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-
Synchronous Mixup Strategies. Aug. 3, 2023. DOI: 10 . 48550 / arXiv. 2308 . 01546. arXiv:
2308.01546 [cs, eess]. URL: http://arxiv.org/abs/2308.01546. preprint.
[212] Peike Li et al. JEN-1: Text-Guided Universal Music Generation with Omnidirectional
Diffusion Models. Aug. 9, 2023. DOI: 10 . 48550 / arXiv. 2308 . 04729. arXiv: 2308 . 04729
[cs, eess]. URL: http://arxiv.org/abs/2308.04729. preprint.
[213] Yatong Bai et al. Accelerating Diffusion-Based Text-to-Audio Generation with Consistency
Distillation. Sept. 19, 2023. DOI: 10.48550/arXiv.2309.10740. arXiv: 2309.10740 [cs,
eess]. URL: http://arxiv.org/abs/2309.10740. preprint.
[214] Pengfei Zhu et al. ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Mod-
els. Sept. 21, 2023. DOI: 10.48550/arXiv.2302.04456. arXiv: 2302.04456 [cs, eess].
URL: http://arxiv.org/abs/2302.04456. preprint.
[215] Yi Yuan et al. Retrieval-Augmented Text-to-Audio Generation. Jan. 5, 2024. DOI: 10.48550/
arXiv. 2309 . 08051. arXiv: 2309 . 08051 [cs, eess]. URL: http : / / arxiv. org / abs / 2309 .
08051. preprint.
[216] Curtis Hawthorne et al. Multi-Instrument Music Synthesis with Spectrogram Diffusion.
Dec. 12, 2022. DOI: 10 . 48550 / arXiv . 2206 . 05408. arXiv: 2206 . 05408 [cs, eess].
URL: http://arxiv.org/abs/2206.05408. preprint.
[217] Minki Kang, Dongchan Min, and Sung Ju Hwang. “Grad-StyleSpeech: Any-Speaker Adap-
tive Text-to-Speech Synthesis with Diffusion Models”. In: ICASSP 2023 - 2023 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). ICASSP
2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). June 2023, pp. 1–5. DOI: 10.1109/ICASSP49357.2023.10095515.
[218] Rongjie Huang et al. “Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced
Diffusion Models”. In: Proceedings of the 40th International Conference on Machine Learn-
ing. International Conference on Machine Learning. PMLR, July 3, 2023, pp. 13916–13932.
[219] Haohe Liu et al. AudioLDM 2: Learning Holistic Audio Generation with Self-supervised
Pretraining. Sept. 9, 2023. DOI: 10 . 48550 / arXiv. 2308 . 05734. arXiv: 2308 . 05734 [cs,
eess]. URL: http://arxiv.org/abs/2308.05734. preprint.
[220] Flavio Schneider et al. Mo\^usai: Text-to-Music Generation with Long-Context Latent
Diffusion. Oct. 23, 2023. DOI: 10 . 48550 / arXiv. 2301 . 11757. arXiv: 2301 . 11757 [cs,
eess]. URL: http://arxiv.org/abs/2301.11757. preprint.

You might also like