0% found this document useful (0 votes)
45 views10 pages

Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa

extended_abstract
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views10 pages

Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa

extended_abstract
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1

Voice Conversion with Deep Learning


Miguel Varela Ramos, Instituto Superior Técnico, Universidade de Lisboa

Abstract—Voice conversion techniques aim to modify a sub- Many techniques have been developed to tackle the VC
ject’s voice characteristics in order to sound like someone else. In problem. One of the most popular approaches is the Joint
this paper, we mainly focus on the spectral mapping component Density Gaussian Mixture Model (JD-GMM) based system
of voice conversion. Most existing deep learning based voice
conversion systems rely on a previous frame alignment step using, [2], [3], which models the joint density between data of the
for instance, dynamic time warping before the conversion of the source and target speakers using a GMM and finds local
speaker’s spectra. We propose a sequence to sequence model with linear transformations for each Gaussian used. Depending on
attention which completely avoids this explicit alignment, map- the criteria used to optimize the conversion function, the
ping spectral features with different lengths. In order to establish implementation of JD-GMM will vary. The most popular
a baseline, we implemented a deep bidirectional long short-term
memory recurrent neural network based voice conversion system, criteria are the Mean Square Error (MSE) [2] and Maximum
which is considered to achieve some of the current state of the Likelihood (ML) trajectory generation [3], which has been
art results and resorts to the feature pre-alignment step. The reported to produce better results. Nevertheless, the JD-GMM
experiments conducted revealed that our model is capable of based VC systems have two main problems: overfitting the
achieving state of the art results if allowed to have a peek into the data and over-smoothing of the converted spectra. In order to
target features, but struggles with prediction if prevented from
accessing this information. Both models are evaluated using an mitigate these problems, techniques such as Global Variance
objective metric and the poor performance during prediction is (GV) [4] and a mutual information criterion [5] were proposed.
discussed, as well as some possible solutions that can be explored. Other techniques have reported some degree of success
in performing voice conversion, and even outperforming the
Index Terms—voice conversion, deep learning, attention mod- JD-GMM approaches. Among these techniques, some of the
els, sequence to sequence, recurrent neural networks, speech most relevant are the Dynamic Kernel Partial Least Squares
processing Regression (DKPLS) [6], Exemplar based approaches [7] and
the deep learning approaches with neural networks [8], [9].
With the deep learning approaches, a more efficient way of
I. I NTRODUCTION
effectively modelling temporal dependencies in the spectral

T HIS work aims to perform an operation of voice conver-


sion between two different subjects. Voice Conversion
(VC) is defined as the process of, given an utterance with the
features was introduced. In [8] the authors use a Recurrent
Temporal Restricted Boltzmann Machine (RTRBM) to model
sequences. This concept is similar to the operation of a
source speaker S’s voice, reproducing the same utterance in the Recurrent Neural Network (RNN) and, as such, in [9] the
target speaker T’s voice. Speaker S’s utterances should acquire authors introduced a Deep Bidirectional Long Short-term
qualities inherent to speaker T, like timbre and pitch, while Memory (DBLSTM) RNN to improve the learning process
saying a completely different utterance. This is possible due of the temporal dependencies.
to the physical characteristics of each individual, associated The above models resort to a step of feature alignment
with one’s vocal tract and glottal source, which differ from before the conversion process. This can be a source of errors
person to person. because the alignment algorithms used, such as the Dynamic
Ideally, a voice conversion technology would convert both Time Warping (DTW) [10], align the features by adding or
timbral and prosodic characteristics from a speaker. In this pa- removing time steps in order to make the lengths match.
per we focus on the conversion of timbral characteristics, more When training a voice conversion system with aligned data, the
specifically spectral features, while leaving some prosodic model will learn altered time dependencies, which can result
characteristics of the source unaltered. Spectral features are in a worse performance.
believed to convey more speaker individuality and are easier In order to bypass the alignment process and be able
to extract and model [1], thus being the focus of our work. to map different length sequences, we propose the use of
Although the success of voice conversion systems is still not sequence to sequence models [11] in combination with an
at desired levels regarding speech quality and naturalness, this attention mechanism [12]. These are techniques that have been
research area has been making relevant breakthroughs in recent very successful in machine translation tasks and shown great
years. potential in tasks involving sequences with different lengths.
Applications such as identity change in a text-to-speech The organization of this paper is as follows: the two
(TTS) system are possible with the use of a voice conversion different RNN architectures and the baseline DBLSTM model
system, without resorting to a huge number of parameters and are described in Section II. Sequence to sequence and attention
inherent limitations of the current methods. Other potential models are introduced in Section III, together with the pro-
applications include security-related usage (e.g. hiding the posed approach for a VC system. To evaluate the performance
identity of the speaker), vocal restoration, speech-to-speech of our approach, objective experiments were conducted, and
translation and movie dubbing, among others. the results are presented in Section IV. A more extensive and
2

critical analysis is presented in Section V. Finally, Section VI where i, f , o, c, c̃ refer to the input gate, forget gate, output
summarizes and presents some conclusion of this work, as gate, cell state and shadow gate, respectively. The LSTM cell
well as possible future work directions. receives information continuously from the previous states via
its ct−1 and ht−1 inputs, just like a conveyor belt. With this
II. BASELINE : D EEP B IDIRECTIONAL R ECURRENT information, combined with the cell’s parameters, the LSTM
N EURAL N ETWORKS is able to decide what to read, write and forget.
2) Gated Recurrent Unit: The GRU was proposed by Cho
A. Recurrent Neural Networks et al. (2014) in [11] as a variation of the current LSTM
An RNN is a type of neural network that has the capability architecture. While the LSTM has to coordinate its writes
of learning based on the content of previous data. Taking a real and forgets, the GRU links them explicitly into one single
world example: if we consider our data being the text content gate, called the update gate. In other words, instead of doing
of this section, a human being would process the meaning of selective writes and selective forgets, the GRU takes a step
each word based on the context of the previous words of the further, and instead does selective overwrites, by setting the
sentence. RNNs model the same behaviour with internal loops, forget gate to 1 minus the input (write) gate. The fundamental
allowing information to persist on the network. equations of the GRU are described as:
For a traditional RNN (also known as vanilla RNN), given
zt = σ(Wz · [ht−1 , xt ]) (9)
an input sequence x = (x1 , · · · , xT ), the hidden vector
h = (h1 , · · · , hT ) and the predicted output vector y = rt = σ(Wr · [ht−1 , xt ]) (10)
(y1 , · · · , yT ) can be computed from t = 1 to T according h˜t = tanh(W · [rt ∗ ht−1 , xt ]) (11)
to the following iterative equations: ht = (1 − zt ) ∗ ht−1 + zt ∗ h˜t (12)
ht = σ(Wh · [ht−1 , xt ] + bh ) (1) where xt is the current input, ht−1 is the previous hidden state,
yt = Wy · [ht−1 , xt ] + by (2) zt is the update gate, rt is the reset gate, h˜t is the shadow gate,
ht is the current hidden state, and W and b are the weight
where ht corresponds to the RNN’s hidden state, xt to the matrices and bias vectors respectively.
input, yt to the output, σ to a non-linear activation function Different studies have compared the performance of several
and W and b are weight matrices and biases, corresponding LSTM variants, reaching the conclusion that they are similar.
to the network’s parameters. However, different architectures work best on different prob-
However, vanilla RNNs have difficulties learning long-term lems [15]. One of the great advantages of the GRU is the lower
dependencies if the information is spanned through longer number of gates, making it less demanding on computation
periods of time. Training these networks with the traditional power to train.
back propagation through time (BPTT) algorithm has been 3) Deep Bidirectional RNNs: In order to improve the
proved to be extremely difficult due to the exploding and modelling of long sequences, a common technique within
vanishing gradient problems [13]. Motivated by this, the Long RNNs is the use of a bidirectional network. A bidirectional
Short-Term Memory (LSTM), and later the Gated Recurrent RNN is no more than a group of two distinct RNNs networks
Unit (GRU), a simplification of the LSTM, were proposed. that process the data from the two ends of the sequence. A
1) Long Short-term Memory: The LSTM was originally forward RNN processes the sequence regularly from start to
proposed by Hochreiter and Schmidhuber in 1997 [14] and end, while a backward RNN inverts the input’s time axis
developed by many in following works. The current form of and processes it from end to start. The outputs from both
the LSTM is capable of learning long-term dependencies and networks are then joined with a merge process. For simplicity,
works very well in a wide variety of problems. we admit the merge process to be a sum. In addition, when
While Vanilla RNNs have a very simple cell structure made dealing with complex problems better represented in a higher
of a single neural network, LSTMs have various structures dimensional feature space, it is common to add more layers to
called gates (four in total). These gates interact with each improve the model’s performance. By combining both deep
other to process information in order to better handle temporal and bidirectional concepts applied to RNNs, we obtain the
relations. Each of these gates can be viewed as a separate DBRNN model. In figure 1, we depict the operation of a two
neural network, and when combined together, they form an layer deep bidirectional RNN.
LSTM cell.
The LSTM equations replace equation 1 in the RNN oper-
ation and are described as following: B. Baseline Framework
The baseline framework adopted in this paper is based
ft = σ(Wf · [ht−1 , xt ] + bf ) (3) on the architecture proposed in [9], with Deep Bidirectional
it = σ(Wi · [ht−1 , xt ] + bi ) (4) LSTMs. However, in this work we add a few minor twists
c˜t = tanh(Wc · [ht−1 , xt ] + bc ) (5) to this approach. The DBLSTM approach consists in a 6
layer DBLSTM to map aligned Mel Generalized Cepstral
ct = ft ∗ ct−1 + it ∗ c˜t (6)
(MGC) features [16] between source and target speakers.
ot = σ(Wo · [ht−1 , xt ] + bo ) (7) Speech parameters are extracted via STRAIGHT analysis [17],
ht = ot ∗ tanh(ct ) (8) including a smooth spectrogram, fundamental frequency (F 0)
3

Fig. 1. A two layer deep bidirectional recurrent neural network. The variable
x represents the input, h represents the hidden state and ŷ represents the
network’s prediction of the target y.

and an aperiodic component. MGCs are derived from the


spectral envelope. The aperiodic component is defined as the
ratio between the lower and upper smoothed spectral envelopes
in the frequency domain. After the features extraction process, Fig. 3. Deep Bidirectional Recurrent Neural Network architecture diagram.
The repetition block will be repeated for a number of times for additional
the alignment of the features is made with DTW. Once the depth. The merge layer is a sum operation and the Time Distributed Dense
alignment is complete, the model is trained using a simple layer is a simple feed forward layer with linear activation.
back propagation algorithm in order to be able to convert
the MGC features. The aperiodic component is kept from
the source speaker and the log F 0 is converted through a to decode that same fixed-length vector representation back
popular linear conversion method, by equalizing the mean and into a variable-length sequence. It is called a sequence to
the standard deviation of the source and target speech. An sequence model because its input and output lengths are
illustration of the baseline DBLSTM framework is shown in usually different.
figure 2. If we consider the source sequence to be represented as x =
(x1 , . . . , xTx ), and the target sequence as y = (y1 , . . . , yTy ),
the encoder is a recurrent neural network that reads each
time step xt sequentially, and summarizes it into a fixed-
length context vector c. This context vector should be able
to summarize the whole sequence information. The decoder
consists of another RNN that takes the context vector c as
input, and progressively generates each time step of the output
yt , given the current hidden state ht . However, both ht and
yt generated by the decoder at each time step are conditioned
by yt−1 and the context c. In order to train such a model,
Fig. 2. Overview of the proposed baseline voice conversion framework.
both encoder and decoder are jointly trained. An illustrative
scheme of a simple sequence to sequence model architecture
is shown in figure 4.
C. Modifications To The Baseline Framework
We propose the replacement of the LSTM cell with a GRU
cell, in order to accelerate training convergence and spare
memory computational resources. To improve generalization,
a dropout layer is added after each bidirectional LSTM layer.
The model architecture for both DBLSTM and Deep Bidirec-
tional Gated Recurrent Unit (DBGRU) is depicted in figure
3.

III. P ROPOSED : D EEP S EQUENCE T O S EQUENCE


ATTENTION M ODEL
A. Sequence To Sequence Models
The encoder decoder architecture, also known as sequence
to sequence, was initially proposed by Cho et al. in [11]. Fig. 4. An illustration of the Encoder Decoder architecture. Extracted from
The sequence to sequence model learns to encode a variable- [11].
length sequence into a fixed-length vector representation and
4

B. Attention Models In equation 13 we considered a content-based attention.


Attention-based decoder models are closely related to the However, it is possible to have a location based attention
sequence to sequence architecture since they also decode an if eij = a(si,j , αi−1 ) or even a hybrid attention version if
input of length Tx into an output of length Ty . Usually, the in- eij = a(si,j , h, αi−1 ).
put x = (x1 , . . . , xTx ) initially goes through an encoder which
outputs a sequential input representation h = (h1 , . . . , hTx ) C. Proposed Framework
more suitable for the attention decoder to work with. Inspired by [18], we propose a sequence to sequence
attention model to tackle the problem of spectral mapping
for spectral features of different lengths. We chose a content-
based attention, both for simplicity and to allow the model
to extract information from other, more distant, neighbouring
frames if relevant. However, since we are dealing with long
sequences that have a strong temporal correlation, a hybrid
attention model would also make sense.
Our proposed architecture, which we call a deep sequence
to sequence attention model, is depicted in figure 6, and makes
use of the ideas proposed by the baseline work [9], along with
the content-based attention mechanism already introduced.
This architecture is constituted by stacked GRU encoders,
Fig. 5. An illustration of the attention decoder architecture. Extracted from
[18]. similar to what is done in the baseline framework, followed by
a GRU attention decoder. The GRU architecture is chosen over
As depicted in figure 5, taking as input the encoded se- the LSTM, because of the lower computational requirements
quence h, the attention decoder first filters the information of the GRU architecture.
with an attention mechanism to obtain the attended input. This
attention mechanism, or attention model, is a neural network
that scores each input hj with respect to the current hidden
state of the decoder si−1 . The scores eij are computed as
eij = a(si−1 , h) (13)
where a is an alignment model parametrized as a feed-forward
neural network, jointly trained with the other components of
the decoder, and the state si−1 is associated with a recurrent
neural network. This RNN, usually a GRU or an LSTM,
unrolls for Ty time steps along with the target y. In [18] a
is defined as:
eij = wT tanh(W si−1 + V hj + b) (14)
where w and b are vectors, and W and V are matrices.
The a operation attributes a value to each score eij that
does not know any lower or upper bound. In order to get
well defined weights αij , also called the attention weights, a
softmax function is computed on the scores to make them sum
to 1 as
exp(eij )
αi,j = PTx (15)
k=1 exp(eij )
Fig. 6. An illustration of the proposed architecture for the deep sequence to
To filter the input, the attention weights αij are multiplied sequence attention model.
by the input and summed on the features dimension axis,
obtaining what is called a glimpse gi : The encoders’ purpose is to process the input into a high
Tx
X dimensional feature space, in order to facilitate the decoding
gi = αij hj . (16) process. The attention decoder receives a full context sequence
j=1 from the stacked encoders, applies its attention mechanism,
Combining information from the previous RNN state si−1 , and decodes the sequence of length Tx into a sequence
the current glimpse gi , and the previous target value yi−1 , a of length Ty . This layer is of critical importance in this
new state of the RNN si is computed as architecture, because it decides in which parts of the input the
decoder should focus, so that an alignment between source
si = Recurrency(si−1 , yi−1 , gi ). (17) and target is possible.
5

Due to the high number of time steps of our spectral fea-


tures, the encoders are bidirectional to better capture temporal
dependencies in both ends of the sequences. However, because
of the step by step prediction limitation, the attention decoder
can not be bidirectional.

D. Dealing with Prediction Errors


The attention decoder makes use of the ground truth, or
target value, y, which is only accessible during the training
process. If we desire to use the model to predict an utterance
that is outside our training and validation set, this information
is not available. In order to do a prediction, the model must be
able to use the information of the output generated by itself,
ŷ, and then use it recurrently step-by-step, until it generates
the full sequence. If there were other layers after the decoder, Fig. 7. Illustration of the scheduled sampling process inspired by the work
the ŷ prediction should be a result of a step by step forward of [19].
pass through the decoder and the remaining layers after it.
This method of prediction does not allow bidirectional layers
and k and c provide the offset and slope of the decay,
for the decoder or any other layer after the decoder, since at
which depend on the expected speed of convergence.
each instant of time of prediction, both forward and backward k
• Inverse sigmoid decay: i = k+exp(i/k) where k ≥ 1
cells would be processing a different time step.
depends on the expected speed of convergence.
During training we experimented with two approaches. i
• Exponential decay: i = k where k < 1 is a constant
The first one relies solely on the ground truth for training
that depends on the expected speed of convergence.
and validation, and then testing by feeding back the output
predicted by the model to itself until the complete sequence In this work, we take i to have an inverse sigmoid decay,
is generated. The second approach consists in progressively as it has the smoothest transition behaviour. Examples of the
alternating the ground truth with the model predictions during different decay functions of i with different values of k are
the training. The latter technique is based on the idea that, depicted in 8.
although the true value of y is always available during the
process of training, it might be not desirable to always provide
the ground truth. By doing this, the model may generate errors
when making predictions with ŷ, and therefore propagate such
errors throughout the whole sequence. For this reason, a way
to make the model handle its own errors during prediction
is proposed. The work of Bengio et al. in [19] introduces
this exact same concept, which the authors call scheduled
sampling, and inspired by it, we implemented it in our training.
However, we are dealing with a continuous space of output
features, instead of a discrete space as in the work where this
technique was proposed.
Scheduled sampling consists of progressively alternating the
model’s yt−1 input between the ground truth value and the
value predicted by itself ŷ. The use of either one of these
values is decided through a flip of a coin, with probability i
for the ground truth value and (1 − i ) for the value predicted Fig. 8. Different functions of i for the curriculum learning approach.
The vertical axis corresponds to the probability i and the horizontal axis
by the model. When i = 1, the model is trained only using corresponds to the number of iterations.
true values of the target, and when i = 0, the model is trained
in the same setting as inference. An illustration of the process
is depicted in figure 7. IV. E XPERIMENTS
In order to gradually introduce the changes between true
and predicted values of y, the authors proposed controlling A. Experimental Setup
the probability i in a decaying fashion, such that the model In our voice conversion experiments we use the CMU
has a greater probability of using the true value of yt−1 at the ARCTIC corpus [20]. We select two male speakers (BDL
beginning of training and then progressively decays towards and RMS) and a female speaker (SLT) to be able to perform
ŷt−1 by the end of training. The different types of decay male to male and male to female conversions. The databases
functions proposed by the authors are: used have a total of 1132 utterances and are divided into three
• Linear decay: i = max(, k − ci), where 0 ≤  < 1 is separate data sets, with 80% allocated for training, 20% for
the minimum amount of truth to be given to the model, validation and 10% for testing. The acoustic signals are 16 bit,
6

16 kHz mono wave files. 49-dimensional MGC features are 1) Baseline Approach: DBRNN: The first experiment to
extracted from a real spectrogram outputted from STRAIGHT, establish the baseline of our work is to test our implementation
with a 2048 Fast Fourier Transform (FFT) window size. The of the work proposed in [9]. In this paper, the authors propose
first coefficient corresponding to the energy component of the the use of 6 layers of bidirectional LSTMs with hidden layer
Mel Cepstral features is removed before training, so that is dimensions of [49, 128, 256, 256, 128, 49]. This configuration
not modelled by the network. intends to progressively create a higher dimensional feature
Both model architectures were implemented in a custom space from layer to layer, in order to try to better represent
Python framework built on top of Theano [21] for the devel- the speaker dependent information. Nevertheless, with neural
opment of this work, due to the lack of support of sequence to networks, the choice of hidden sizes is most often an ad-
sequence and attention models in other deep learning libraries. hod process, and there is no report of tests with different
In order to feed the data to our model, we resorted to a mini- configurations in the paper.
batch processing approach, a common technique for training We tested the model with the originally proposed configura-
neural networks. Samples are grouped into mini-batches and tion to perform male to male and male to female conversions.
fed through the model in order to be processed. To enable Next, an experiment with a wider hidden size configuration of
mini-batch processing in an RNN, it is necessary to pad each [128, 256, 256, 256, 256, 128] was performed, to evaluate if
sample’s time axis with zeros, up to that batch’s maximum the model benefits from larger hidden sizes.
length. However, this process can introduce undesired effects
in our training and lead to unnecessary data modelling from TABLE I
the model. To mitigate the effect of the zero padding, we M EL CEPSTRAL DISTORTION VALUES IN D B FOR THE BASELINE DEEP
BIDIRECTIONAL LONG SHORT- TERM MEMORY RECURRENT NEURAL
use masking and bucketing techniques. Masking consists in NETWORK (DBLSTM) MODEL FOR MALE - MALE AND MALE - FEMALE
passing each sample’s length to the network in order to ‘mask‘ CONVERSIONS .
the padded time steps. Bucketing is a simple technique that
mitigates the amount of zero padding by batching similar MCD (dB)
length samples. In addition, we normalize the features of each Model Source Target Train Validation Test
utterance to zero mean and unity standard deviation. DBLSTM bdl (♂) rms (♂) 5.82 6.58 6.54
The training process was conducted using the ADAM [22] DBLSTM Wide bdl (♂) rms (♂) 5.72 6.57 6.51
DBLSTM bdl (♂) slt (♀) 6.14 6.41 6.47
optimizer, and the baseline and proposed models were trained
for 30 and 60 epochs respectively, with a batch size of 10.
An early stopping criterion with a patience mechanism was From table I, it is possible to observe that we reach values of
used in order to guarantee the best model generalization. In distortion relatively close to the ones presented in [9], which
addition, an NVidia Titan X graphics processing unit (GPU) reported MCD of around 5.5 dB, despite being slightly higher.
was used to enable faster matrix multiplications. This can be attributed to a variety of factors. The results from
[9] are only relative to a male to female conversion with
speakers AWB (male) and SLT (female). In this work, due to
an utterance incompatibility related to the different versions of
B. Objective Evaluation
the CMU ARCTIC database, it was not possible to reproduce
We evaluate the different architectures and compare them the exact same male to female conversion. Moreover, our
among each other using the Mel cepstral distortion (MCD) as models were trained for about 30 epochs, with a batch size
an objective evaluation metric. MCD is defined as follows: of 10, until the early stopping condition was reached, which
v corresponds to about 10 hours. The baseline paper approach
u N states 48 hours of training, which should correspond to more
10 u X
M CD[dB] = t2 (cd − cconverted
d )2 (18) that 30 epochs, although the GPU the authors used has less
ln 10 computation power than the Titan X, and there are no reports
d=1
of the batch size used. It is also relevant to point out that
For the baseline models that output predictions of size the original baseline results are not stated to be tested in a
different than the target, dynamic time warping between the test set, regarding the objective measure results presented.
converted and target features is used right before the evalua- Nevertheless, it is possible that a better convergence in our
tion. Performing this alignment is required since both matrices implementation without losing generalization can be reached
of the converted and target features need to have a similar if the maximum patience of the batch update is increased.
number of time steps. The proposed sequence to sequence With the results obtained from our experiments, it is possi-
attention model, however, outputs converted utterances with ble to observe that on average, the DBLSTM model performs
the same length as the target, which eliminates the need for better on male to female conversion. This fact might be
this step. attributed to the larger gap in spectrogram differences between
It is also relevant to point out that, since the energy male and female individuals that facilitate the learning process,
component is directly extracted from either the source or the opposed to less pronounced changes in timbre as in the case
target features, this component should be removed from the of a male to male conversion.
MGC before evaluating, otherwise it will introduce undesired Regarding the hidden sizes differences, it is observed that
perturbations. the model benefits from the larger hidden size configuration
7

in the male to male conversion. Despite the differences in batched together. Discarding this technique would emphasise
the MCD from the original configuration being minimal, this the performance differences from the usage of masking.
can translate in less artifacts and slightly better audio quality 4) Sequence To Sequence Attention Model: To test our pro-
in informal listening tests. Experimentation with other wider posed sequence to sequence attention model, we experimented
configurations could be done with the architecture, but that with three different scenarios, and compare them with the best
has to be left for future work. model obtained in the baseline experiments. All the scenarios
2) Baseline Approach: LSTM vs GRU: LSTMs have shown perform male to male conversions for the same source and
very good results on a wide variety of problems, but the target speakers.
recently proposed GRUs are stated to have similar results The first scenario allows the model prediction process to
to LSTMs. In addition, due to the GRU more compressed have the same information it has during training, i.e. we allow
architecture, it demands less usage of computational power the model to access yt−1 information from the target. The
and has slightly faster convergence times. In this set of ex- purpose of this experiment is to evaluate the model’s behaviour
periments, we intended to compare the performances between during training and evaluate how well it is learning from the
the DBLSTM and the deep bidirectional gated recurrent unit data.
(DBGRU) models. In the second scenario the model is not allowed to access
yt−1 from the target, and has to output its own predictions ŷt−1
TABLE II and use them iteratively to generate the whole sequence. The
M EL CEPSTRAL DISTORTION VALUES COMPARISON BETWEEN THE USE OF goal of this experiment is to evaluate the error propagation
A GRU CELL VERSUS THE USE OF AN LSTM CELL IN THE DBRNN
MODEL . of the model throughout the whole sequence, due to the
prediction errors.
MCD (dB) Finally, in the third and last scenario, the model from
Model Source Target Train Validation Test the first and second scenarios is retrained by progressively
DBLSTM bdl (♂) rms (♂) 5.82 6.58 6.54 introducing the model’s predictions instead of the target values
DBGRU bdl (♂) rms (♂) 6.09 6.82 6.77 of yt , using a scheduled sampling technique. With this final
scenario, the model’s capacity of dealing with its own errors
From table II it is possible to observe that the LSTM may be evaluated.
performed slightly better than the GRU. Nevertheless, the
GRU may be ideal in a scenario with less computational power TABLE IV
M EL CEPSTRAL DISTORTION VALUES COMPARISON MALE TO MALE
or less GPU memory, in order to obtain a similar performance CONVERSION USING THE PROPOSED SEQUENCE TO SEQUENCE ATTENTION
to what is obtained with the usage of an LSTM, but with a MODEL .
lighter setup.
3) Baseline Approach: Masking vs No Masking: In the MCD (dB)
baseline architectures, the authors did not reveal the use of Model Source Target Train Validation Test
any masking techniques for the training of their model. Since DBLSTM Wide bdl (♂) rms (♂) 5.72 6.57 6.51
the DBLSTM model processes batches of padded sequences, Seq2Seq Attention [S1] bdl (♂) rms (♂) 3.70 3.84 3.92
Seq2Seq Attention [S2] bdl (♂) rms (♂) 16.48 17.19 16.29
we argue that masking has a relevant role on improving Seq2Seq Attention [S3] bdl (♂) rms (♂) 14.99 16.11 16.25
the model’s performance and avoiding modelling unnecessary
information. We experimented with two different scenarios,
The results from the experiments are presented in table IV.
one in which we mask the training and validation padded
The sequence to sequence attention models have 6-layer of
sequences, and one in which the sequences are padded but not
depth and were trained with a [128, 256, 256, 256, 256, 128]
masked. Both scenarios were tested using a DBLSTM model
hidden size configuration. Training these models took about
with the original baseline hidden size configuration.
60 epochs and a total of 50 hours, using a batch size of 10
sentences on an NVidia Titan X GPU.
TABLE III
M EL CEPSTRAL DISTORTION VALUES COMPARISON BETWEEN THE USE OF In the first scenario (S1), it is observed that the model
MASKING WITH THE DBLSTM MODEL . outperforms the baseline approach, reaching MCD values that
are about half the distortions from the DBLSTM model.
MCD (dB) Nonetheless, these results can only be used as an indicator
Model Source Target Train Validation Test of how well the model has learned the data, which has shown
DBLSTM - masking bdl (♂) rms (♂) 5.82 6.58 6.54 to be promising, although in a real prediction situation, it is
DBLSTM - no masking bdl (♂) rms (♂) 5.85 6.61 6.55 not possible to access the true values of y.
In the second scenario (S2), it is clearly noticeable that
Observing the results shown in table III, it is clear that the model struggles and does not handle well its own errors.
masking has an impact in the model’s performance, since it The converted speech can barely be identified as speech, as
has shown lower Mel cepstral distortion values throughout it sounds muffled. This behaviour can be associated with the
all the sets it was tested on. Nevertheless, it is relevant to model relying too much on a correct input coming in from
point out that the bucketing technique was used for both yt−1 , which can immediately lead to errors in the first few time
cases, and therefore sequences with similar lengths were steps and propagate throughout the remainder of the sequence.
8

It is important to keep in mind that we are dealing with a


number of time steps that can vary between 2000 and 8000,
depending on the length of the sentences. Compared to the
initial applications of attention models for machine translation
that dealt with a few dozen time steps, this is a considerable
gap.
Fig. 9. Spectrogram of a converted utterance from a sequence to sequence
In the third and final scenario (S3), the model benefits from attention model trained with scenario S3.
the introduction of its own predictions during training with the
scheduled sampling technique and the distortion is reduced in
about 1 dB. However, the improvements are not substantial, to perform a beam search for the N best output candidates
especially if compared to the initial performance of the system and allow the model to make a more informed decision for its
in the first scenario. The final converted audio from these output. However, in a continuous feature space, as the search
experiments still sound like a babbling of some sort, and the space is infinitely large, it becomes more difficult to use this
language content is still not perceptible. technique.
Our architecture, as currently presented, is not capable of
using its own predictions to generate a real valued output
V. D ISCUSSION
prediction. This can be caused by a variety of factors from
During the first half of these experiments we have success- the architecture. To start with, the fact of the attention is not
fully reproduced the baseline model and proposed a few minor being conditioned by the location, in order to be smoothed and
modifications that can be useful in scenarios where there are be more continuous in time, may be affecting performance.
hardware or time limitations, with the use of the GRU, and Changing from a pure content based attention to a hybrid
improved the model’s performance with a different hidden size attention mechanism can help improve results. Another factor
configuration. Nevertheless, it is important to point out that is the robustness of the architecture that deals with the input
despite the fact of these voice conversion systems based in coming from the target. Introducing a more robust neural
deep learning have been demonstrating great success regarding network able to handle both values from the ground truth and
performance, they have a few disadvantages when compared the error prone predictions may also contribute to the model’s
to other, more conventional systems. To start with, we are still improvement. An example of an utterance predicted from our
dealing with a one-to-one type of voice conversion, meaning model is presented in figure 9. It is possible to observe a
that we would require to train a different model for each repeating pattern, which is an indicator of the model’s poor
pair of speakers that we desire to convert. Next, there is the performance.
training time factor that is far superior when compared to some There are a few more experiments and techniques that could
other VC methods, and finally, there is the need to weigh the be experimented with our model in order to solve this problem.
considerable amount of parallel data these systems require in However, due to the limited time a Master thesis imposes, this
order to be trained. research has to be left as future work. With this in mind, our
Despite not solving most of the problems exposed, the experiments do not provide enough evidence to discard this
sequence to sequence attention model proposed was motivated model for continuous inputs just yet, and rather provide extra
by the problem of the feature alignment requirement in the motivation to explore possible future solutions to solve the
baseline approach, which introduced errors in the data that problem of regression for sequences with different lengths for
might lead to a deprecation in the model’s performance. In real valued data.
addition, there is no report of usage of these type of models
to solve problems in a continuous space, which increased the VI. C ONCLUSIONS
motivation for our investigation.
With the deep sequence to sequence attention model pro- A. Contributions
posed, we verified from the experiments that the model is The objective of this thesis was to contribute to the de-
capable of learning from the data with results that surpass velopment of a voice conversion system with deep learning
the performance from the baseline approach. However, these techniques that eliminate the requirement for alignment the
results are only valid for a situation where the model can source and target features.
access some information from the target, an impossible sit- We started by exploring the state of the art, selecting one
uation in a real world application. In a situation in which the of the most promising techniques related to deep learning,
target information is not available, the model will necessarily and implemented it. The baseline technique consisted of a
generate the output sequence one step at a time, and use its stack of bidirectional LSTM recurrent neural networks that
own prediction as target information. If the first few iterations were intended to map the aligned input spectral features into
of the generation process include an error factor, the model a target.
will be led by this error factor and augment it, since each The fact that the implementation of the baseline approach
following time step will have its own error factor. relied on the performance of the dynamic time warping
This error propagation is mitigated in a discrete problem algorithm provided the motivation to explore alternatives that
situation by maximizing the likelihood of each candidate for could handle an alignment within the model itself. By tak-
the next output yt and choosing accordingly. It is even possible ing inspiration from the current state of the art in machine
9

translation, which resorts to the encoder decoder and attention to rely on the previous target input, we suggest the
architectures, we explored the same model concepts and ap- introduction of a separate neural network just to deal with
plied it to our spectral mapping problem. However, there were the prediction errors and minimize the error propagation
some significant conceptual differences between the machine throughout the sequence.
translation and spectral mapping problems, such as the real • Output Conditioning: in order to narrow down the
valued outputs and the considerably higher number of time output decisions, we suggest to add a text representation
steps. to condition the output. Work from [23] has reportedly
Our architecture relies on a stack of GRU encoders to achieved successful results decoding a real-valued se-
encode the sequence into a higher dimensional representation, quence with output conditioning, by using LSTMs. The
also known as a context. The context is decoded step by step same concept could be applied to voice conversion.
by a decoder with an attention mechanism so that the model • Discrete Features: another possibility would be to quan-
can focus on different interest points from the input. While tize the output features to be able to apply the same deci-
doing this, the model is at the same time aligning both input sion techniques during the prediction phase, such as beam
and output sequences. search. Techniques such as vector quantization could be
All the experiments from both the implemented models applied to achieve a discrete feature space. However, this
were performed on the CMU ARCTIC corpus and the results approach might be introduce other limitations, due to the
were evaluated using a Mel Cepstral Distortion objective way the quantization is done. In comparison with the
measure. It was observed that our implementation of the error introduced by the DTW, the quantization may do
baseline approach achieved very similar results to the ones more harm than good.
reported in the original paper. However, no subjective measure • Alignment Information: since the model’s behaviour
was performed for comparison. hints that the information provided is not being enough
The experiments with the sequence to sequence attention for it to deal with the predictions, it would be interest-
model revealed promising results during training, when al- ing to explore the possibility of providing just enough
lowed to peek at the target information. Nonetheless, the information about the start and end of each phoneme.
experiments in a full prediction environment where target This way, the model could improve its learning ability
information was not accessible, turned out to have a poor in deciding whether to trust its previous prediction or
performance. These results were attributed to the difficulties not. This information wouldn’t affect significantly the
in error handling during the prediction stage, together with the model’s performance, since this information can easily
considerable number of time steps. Despite this turn of events, be compressed.
there is still not enough evidence to disprove the success of a Taking our focus away from the sequence to sequence
sequence to sequence model with an attention mechanism to models, there are a few interesting new developments that
deal with regression problem in a continuous feature space, as can be interesting for the future of voice conversion systems
there are more experiments that should be conducted. with deep learning. Recently, WaveNet [24] was proposed
While implementing the baseline and proposed models, we by Google’s Deepmind, presenting a revolutionary state of
developed a Python framework based on Theano to support the art technique for text-to-speech synthesis that outperforms
recurrent neural networks, sequence to sequence models and every other current techniques. It would be interesting to
attention models. This framework will be available for an explore WaveNet’s architecture and develop a voice conversion
academic environment to allow anyone to easily apply these system on top of it. Moreover, a many-to-one voice conversion
type of models to their own problems. framework using DBLSTMs was proposed in [25], using
Ultimately, the two main contributions of this paper are Phonetic PosteriorGrams (PPGs) obtained from a speaker-
the study of the behaviour of attention models with real independent automatic speech recognition (SI-ASR) system
valued inputs and outputs, on which there were still no reports and without the need for parallel data, which might be worth
available, and the development of the sequence to sequence looking into as well.
Theano framework. The developed deep learning framework is currently under
development in order to make it more robust to user errors,
improve overall code quality and user experience. The frame-
B. Future Work
work should be publicly available via github soon.
There are a few possible future work directions this research
can take. Taking into account the further development of the A PPENDIX A
model concept we proposed, we suggest a few techniques L2F D EEP L EARNING : A D EEP L EARNING F RAMEWORK
that can be tried out to help to mitigate the error propagation In order to address the deep learning stage of this thesis,
problem: several frameworks were considered, for example, Keras [26]
• Hybrid attention: the attention mechanism could benefit and Blocks [27], among others. However, the flexibility we
from a smoothing, or a hybrid attention mechanism, in encountered in these frameworks (at least at the time of this
order to block the attention weights to disperse through thesis realization) did not met our requirements. Therefore, we
further neighbouring frames which can cause problems. developed a small framework for personal usage, although it
• Error handling network: to learn from the error of is currently being developed into a cleaner version for general
the model predicted and to prevent the whole model usage.
10

The framework was built on top of the well known math- [5] H. T. Hwang, Y. Tsao, H. M. Wang, Y. R. Wang, and S. H. Chen,
ematical framework Theano [21], this way supporting all the “A study of mutual information for gmm-based spectral conversion,”
in INTERSPEECH 2012, 13th Annual Conference of the International
sound mathematical operations required by the models and Speech Communication Association, Portland, Oregon, USA, September
layers developed. Developing right on top of Theano allows 9-13, 2012, 2012, pp. 78–81.
a greater flexibility versus the other frameworks considered [6] E. Helander, H. Silén, T. Virtanen, and M. Gabbouj, “Voice conversion
using dynamic kernel partial least squares regression,” IEEE Trans.
(some are also built on top of this mathematical framework), Audio, Speech & Language Processing, vol. 20, no. 3, pp. 806–817,
although this also implied that some features already on those 2012.
frameworks had to be implemented. Some of these features, [7] Z. Wu, E. Chng, and H. Li, “Exemplar-based voice conversion using
joint nonnegative matrix factorization,” Multimedia Tools Appl., vol. 74,
such as, optimizers, were borrowed and put in our project with no. 22, pp. 9943–9958, 2015.
the reference and credit from where they were developed. [8] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using
The proposed framework is structured into four main mod- speaker-dependent conditional restricted boltzmann machine,” EURASIP
J. Audio, Speech and Music Processing, vol. 2015, p. 8, 2015.
ules: layers, models, optimization, and utils. The [9] L. Sun, S. Kang, K. Li, and H. M. Meng, “Voice conversion using deep
layers module contains the layer class template, which we bidirectional long short-term memory based recurrent neural networks,”
call GenericLayer, and implemented layers that are built in in 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April
into the framework and that should be used through a model. 19-24, 2015, 2015, pp. 4869–4873.
The models module contains the implemented models we [10] L. R. Rabiner and B. Juang, Fundamentals of speech recognition, ser.
developed for this framework, and which can be trained by the Prentice Hall signal processing series. Prentice Hall, 1993.
[11] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
user with a few lines of code. The optimization module H. Schwenk, and Y. Bengio, “Learning phrase representations using
content includes everything that is related to model optimiza- RNN encoder-decoder for statistical machine translation,” in Proceed-
tion, such as optimizer algorithms (e.g. ADAM, RMSProp, ings of the 2014 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting
SGD), batch update training, and regularization techniques like of SIGDAT, a Special Interest Group of the ACL, 2014, pp. 1724–1734.
dropout. Finally, the utils module deals with everything [12] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
else that is not included in the remaining modules, such as jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
[13] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term dependen-
decay functions for a curriculum learning train, data handlers cies with gradient descent is difficult,” IEEE Trans. Neural Networks,
templates and Theano variables initializations. vol. 5, no. 2, pp. 157–166, 1994.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
In our proposed framework, the user is able to implement Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
its own layer, although there are a few implemented built-in [15] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation
layers: of gated recurrent neural networks on sequence modeling,” CoRR, vol.
abs/1412.3555, 2014.
• Attention mechanism [16] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized
• Fully Connected Layer (Dense) cepstral analysis - a unified approach to speech spectral estimation,”
in The 3rd International Conference on Spoken Language Processing,
• Time Distributed Fully Connected Layer (Time Dis- ICSLP 1994, Yokohama, Japan, September 18-22, 1994, 1994.
tributed Dense) [17] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Restructuring
• GRU speech representations using a pitch-adaptive timefrequency smoothing
and an instantaneous-frequency-based {F0} extraction: Possible role of
• Bidirectional GRU a repetitive structure in sounds1,” Speech Communication, vol. 27, no.
• LSTM 34, pp. 187 – 207, 1999.
• Bidirectional LSTM [18] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Advances in Neural
• GRU Decoder Information Processing Systems 28: Annual Conference on Neural
• LSTM Decoder Information Processing Systems 2015, December 7-12, 2015, Montreal,
• GRU Attention Decoder Quebec, Canada, 2015, pp. 577–585.
[19] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling
In addition, the model classes used in this work are also for sequence prediction with recurrent neural networks,” CoRR, vol.
abs/1506.03099, 2015.
available for immediate use, such as the DBLSTM, DBGRU [20] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in
and deep sequence to sequence with attention. Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16,
2004, 2004, pp. 223–224.
[21] Theano Development Team, “Theano: A Python framework for
fast computation of mathematical expressions,” arXiv e-prints, vol.
R EFERENCES abs/1605.02688, May 2016.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[1] Z. Wu, “Spectral mapping for voice conversion,” Ph.D. dissertation, CoRR, vol. abs/1412.6980, 2014.
Nanyang Technological University, 3 2015. [23] A. Graves, “Generating sequences with recurrent neural networks,”
[2] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech CoRR, vol. abs/1308.0850, 2013.
synthesis,” in Proceedings of the 1998 IEEE International Conference [24] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
on Acoustics, Speech and Signal Processing, ICASSP ’98, Seattle, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
Washington, USA, May 12-15, 1998, 1998, pp. 285–288. A generative model for raw audio,” CoRR, vol. abs/1609.03499v2, 2016.
[3] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on [25] L. Sun, K. Li, H. Wang, S. Kang, and H. M. Meng, “Phonetic
maximum-likelihood estimation of spectral parameter trajectory,” IEEE posteriorgrams for many-to-one voice conversion without parallel data
Trans. Audio, Speech & Language Processing, vol. 15, no. 8, pp. 2222– training,” in IEEE International Conference on Multimedia and Expo,
2235, 2007. ICME 2016, Seattle, WA, USA, July 11-15, 2016, 2016, pp. 1–6.
[4] H. Hwang, Y. Tsao, H. Wang, Y. Wang, and S. Chen, “Incorporating [26] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
global variance in the training phase of gmm-based voice conversion,” [27] B. van Merriënboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-
in Asia-Pacific Signal and Information Processing Association Annual Farley, J. Chorowski, and Y. Bengio, “Blocks and fuel: Frameworks for
Summit and Conference, APSIPA 2013, Kaohsiung, Taiwan, October 29 deep learning,” CoRR, vol. abs/1506.00619, 2015. [Online]. Available:
- November 1, 2013, 2013, pp. 1–6. http://arxiv.org/abs/1506.00619

You might also like