Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
Abstract—Voice conversion techniques aim to modify a sub- Many techniques have been developed to tackle the VC
ject’s voice characteristics in order to sound like someone else. In problem. One of the most popular approaches is the Joint
this paper, we mainly focus on the spectral mapping component Density Gaussian Mixture Model (JD-GMM) based system
of voice conversion. Most existing deep learning based voice
conversion systems rely on a previous frame alignment step using, [2], [3], which models the joint density between data of the
for instance, dynamic time warping before the conversion of the source and target speakers using a GMM and finds local
speaker’s spectra. We propose a sequence to sequence model with linear transformations for each Gaussian used. Depending on
attention which completely avoids this explicit alignment, map- the criteria used to optimize the conversion function, the
ping spectral features with different lengths. In order to establish implementation of JD-GMM will vary. The most popular
a baseline, we implemented a deep bidirectional long short-term
memory recurrent neural network based voice conversion system, criteria are the Mean Square Error (MSE) [2] and Maximum
which is considered to achieve some of the current state of the Likelihood (ML) trajectory generation [3], which has been
art results and resorts to the feature pre-alignment step. The reported to produce better results. Nevertheless, the JD-GMM
experiments conducted revealed that our model is capable of based VC systems have two main problems: overfitting the
achieving state of the art results if allowed to have a peek into the data and over-smoothing of the converted spectra. In order to
target features, but struggles with prediction if prevented from
accessing this information. Both models are evaluated using an mitigate these problems, techniques such as Global Variance
objective metric and the poor performance during prediction is (GV) [4] and a mutual information criterion [5] were proposed.
discussed, as well as some possible solutions that can be explored. Other techniques have reported some degree of success
in performing voice conversion, and even outperforming the
Index Terms—voice conversion, deep learning, attention mod- JD-GMM approaches. Among these techniques, some of the
els, sequence to sequence, recurrent neural networks, speech most relevant are the Dynamic Kernel Partial Least Squares
processing Regression (DKPLS) [6], Exemplar based approaches [7] and
the deep learning approaches with neural networks [8], [9].
With the deep learning approaches, a more efficient way of
I. I NTRODUCTION
effectively modelling temporal dependencies in the spectral
critical analysis is presented in Section V. Finally, Section VI where i, f , o, c, c̃ refer to the input gate, forget gate, output
summarizes and presents some conclusion of this work, as gate, cell state and shadow gate, respectively. The LSTM cell
well as possible future work directions. receives information continuously from the previous states via
its ct−1 and ht−1 inputs, just like a conveyor belt. With this
II. BASELINE : D EEP B IDIRECTIONAL R ECURRENT information, combined with the cell’s parameters, the LSTM
N EURAL N ETWORKS is able to decide what to read, write and forget.
2) Gated Recurrent Unit: The GRU was proposed by Cho
A. Recurrent Neural Networks et al. (2014) in [11] as a variation of the current LSTM
An RNN is a type of neural network that has the capability architecture. While the LSTM has to coordinate its writes
of learning based on the content of previous data. Taking a real and forgets, the GRU links them explicitly into one single
world example: if we consider our data being the text content gate, called the update gate. In other words, instead of doing
of this section, a human being would process the meaning of selective writes and selective forgets, the GRU takes a step
each word based on the context of the previous words of the further, and instead does selective overwrites, by setting the
sentence. RNNs model the same behaviour with internal loops, forget gate to 1 minus the input (write) gate. The fundamental
allowing information to persist on the network. equations of the GRU are described as:
For a traditional RNN (also known as vanilla RNN), given
zt = σ(Wz · [ht−1 , xt ]) (9)
an input sequence x = (x1 , · · · , xT ), the hidden vector
h = (h1 , · · · , hT ) and the predicted output vector y = rt = σ(Wr · [ht−1 , xt ]) (10)
(y1 , · · · , yT ) can be computed from t = 1 to T according h˜t = tanh(W · [rt ∗ ht−1 , xt ]) (11)
to the following iterative equations: ht = (1 − zt ) ∗ ht−1 + zt ∗ h˜t (12)
ht = σ(Wh · [ht−1 , xt ] + bh ) (1) where xt is the current input, ht−1 is the previous hidden state,
yt = Wy · [ht−1 , xt ] + by (2) zt is the update gate, rt is the reset gate, h˜t is the shadow gate,
ht is the current hidden state, and W and b are the weight
where ht corresponds to the RNN’s hidden state, xt to the matrices and bias vectors respectively.
input, yt to the output, σ to a non-linear activation function Different studies have compared the performance of several
and W and b are weight matrices and biases, corresponding LSTM variants, reaching the conclusion that they are similar.
to the network’s parameters. However, different architectures work best on different prob-
However, vanilla RNNs have difficulties learning long-term lems [15]. One of the great advantages of the GRU is the lower
dependencies if the information is spanned through longer number of gates, making it less demanding on computation
periods of time. Training these networks with the traditional power to train.
back propagation through time (BPTT) algorithm has been 3) Deep Bidirectional RNNs: In order to improve the
proved to be extremely difficult due to the exploding and modelling of long sequences, a common technique within
vanishing gradient problems [13]. Motivated by this, the Long RNNs is the use of a bidirectional network. A bidirectional
Short-Term Memory (LSTM), and later the Gated Recurrent RNN is no more than a group of two distinct RNNs networks
Unit (GRU), a simplification of the LSTM, were proposed. that process the data from the two ends of the sequence. A
1) Long Short-term Memory: The LSTM was originally forward RNN processes the sequence regularly from start to
proposed by Hochreiter and Schmidhuber in 1997 [14] and end, while a backward RNN inverts the input’s time axis
developed by many in following works. The current form of and processes it from end to start. The outputs from both
the LSTM is capable of learning long-term dependencies and networks are then joined with a merge process. For simplicity,
works very well in a wide variety of problems. we admit the merge process to be a sum. In addition, when
While Vanilla RNNs have a very simple cell structure made dealing with complex problems better represented in a higher
of a single neural network, LSTMs have various structures dimensional feature space, it is common to add more layers to
called gates (four in total). These gates interact with each improve the model’s performance. By combining both deep
other to process information in order to better handle temporal and bidirectional concepts applied to RNNs, we obtain the
relations. Each of these gates can be viewed as a separate DBRNN model. In figure 1, we depict the operation of a two
neural network, and when combined together, they form an layer deep bidirectional RNN.
LSTM cell.
The LSTM equations replace equation 1 in the RNN oper-
ation and are described as following: B. Baseline Framework
The baseline framework adopted in this paper is based
ft = σ(Wf · [ht−1 , xt ] + bf ) (3) on the architecture proposed in [9], with Deep Bidirectional
it = σ(Wi · [ht−1 , xt ] + bi ) (4) LSTMs. However, in this work we add a few minor twists
c˜t = tanh(Wc · [ht−1 , xt ] + bc ) (5) to this approach. The DBLSTM approach consists in a 6
layer DBLSTM to map aligned Mel Generalized Cepstral
ct = ft ∗ ct−1 + it ∗ c˜t (6)
(MGC) features [16] between source and target speakers.
ot = σ(Wo · [ht−1 , xt ] + bo ) (7) Speech parameters are extracted via STRAIGHT analysis [17],
ht = ot ∗ tanh(ct ) (8) including a smooth spectrogram, fundamental frequency (F 0)
3
Fig. 1. A two layer deep bidirectional recurrent neural network. The variable
x represents the input, h represents the hidden state and ŷ represents the
network’s prediction of the target y.
16 kHz mono wave files. 49-dimensional MGC features are 1) Baseline Approach: DBRNN: The first experiment to
extracted from a real spectrogram outputted from STRAIGHT, establish the baseline of our work is to test our implementation
with a 2048 Fast Fourier Transform (FFT) window size. The of the work proposed in [9]. In this paper, the authors propose
first coefficient corresponding to the energy component of the the use of 6 layers of bidirectional LSTMs with hidden layer
Mel Cepstral features is removed before training, so that is dimensions of [49, 128, 256, 256, 128, 49]. This configuration
not modelled by the network. intends to progressively create a higher dimensional feature
Both model architectures were implemented in a custom space from layer to layer, in order to try to better represent
Python framework built on top of Theano [21] for the devel- the speaker dependent information. Nevertheless, with neural
opment of this work, due to the lack of support of sequence to networks, the choice of hidden sizes is most often an ad-
sequence and attention models in other deep learning libraries. hod process, and there is no report of tests with different
In order to feed the data to our model, we resorted to a mini- configurations in the paper.
batch processing approach, a common technique for training We tested the model with the originally proposed configura-
neural networks. Samples are grouped into mini-batches and tion to perform male to male and male to female conversions.
fed through the model in order to be processed. To enable Next, an experiment with a wider hidden size configuration of
mini-batch processing in an RNN, it is necessary to pad each [128, 256, 256, 256, 256, 128] was performed, to evaluate if
sample’s time axis with zeros, up to that batch’s maximum the model benefits from larger hidden sizes.
length. However, this process can introduce undesired effects
in our training and lead to unnecessary data modelling from TABLE I
the model. To mitigate the effect of the zero padding, we M EL CEPSTRAL DISTORTION VALUES IN D B FOR THE BASELINE DEEP
BIDIRECTIONAL LONG SHORT- TERM MEMORY RECURRENT NEURAL
use masking and bucketing techniques. Masking consists in NETWORK (DBLSTM) MODEL FOR MALE - MALE AND MALE - FEMALE
passing each sample’s length to the network in order to ‘mask‘ CONVERSIONS .
the padded time steps. Bucketing is a simple technique that
mitigates the amount of zero padding by batching similar MCD (dB)
length samples. In addition, we normalize the features of each Model Source Target Train Validation Test
utterance to zero mean and unity standard deviation. DBLSTM bdl (♂) rms (♂) 5.82 6.58 6.54
The training process was conducted using the ADAM [22] DBLSTM Wide bdl (♂) rms (♂) 5.72 6.57 6.51
DBLSTM bdl (♂) slt (♀) 6.14 6.41 6.47
optimizer, and the baseline and proposed models were trained
for 30 and 60 epochs respectively, with a batch size of 10.
An early stopping criterion with a patience mechanism was From table I, it is possible to observe that we reach values of
used in order to guarantee the best model generalization. In distortion relatively close to the ones presented in [9], which
addition, an NVidia Titan X graphics processing unit (GPU) reported MCD of around 5.5 dB, despite being slightly higher.
was used to enable faster matrix multiplications. This can be attributed to a variety of factors. The results from
[9] are only relative to a male to female conversion with
speakers AWB (male) and SLT (female). In this work, due to
an utterance incompatibility related to the different versions of
B. Objective Evaluation
the CMU ARCTIC database, it was not possible to reproduce
We evaluate the different architectures and compare them the exact same male to female conversion. Moreover, our
among each other using the Mel cepstral distortion (MCD) as models were trained for about 30 epochs, with a batch size
an objective evaluation metric. MCD is defined as follows: of 10, until the early stopping condition was reached, which
v corresponds to about 10 hours. The baseline paper approach
u N states 48 hours of training, which should correspond to more
10 u X
M CD[dB] = t2 (cd − cconverted
d )2 (18) that 30 epochs, although the GPU the authors used has less
ln 10 computation power than the Titan X, and there are no reports
d=1
of the batch size used. It is also relevant to point out that
For the baseline models that output predictions of size the original baseline results are not stated to be tested in a
different than the target, dynamic time warping between the test set, regarding the objective measure results presented.
converted and target features is used right before the evalua- Nevertheless, it is possible that a better convergence in our
tion. Performing this alignment is required since both matrices implementation without losing generalization can be reached
of the converted and target features need to have a similar if the maximum patience of the batch update is increased.
number of time steps. The proposed sequence to sequence With the results obtained from our experiments, it is possi-
attention model, however, outputs converted utterances with ble to observe that on average, the DBLSTM model performs
the same length as the target, which eliminates the need for better on male to female conversion. This fact might be
this step. attributed to the larger gap in spectrogram differences between
It is also relevant to point out that, since the energy male and female individuals that facilitate the learning process,
component is directly extracted from either the source or the opposed to less pronounced changes in timbre as in the case
target features, this component should be removed from the of a male to male conversion.
MGC before evaluating, otherwise it will introduce undesired Regarding the hidden sizes differences, it is observed that
perturbations. the model benefits from the larger hidden size configuration
7
in the male to male conversion. Despite the differences in batched together. Discarding this technique would emphasise
the MCD from the original configuration being minimal, this the performance differences from the usage of masking.
can translate in less artifacts and slightly better audio quality 4) Sequence To Sequence Attention Model: To test our pro-
in informal listening tests. Experimentation with other wider posed sequence to sequence attention model, we experimented
configurations could be done with the architecture, but that with three different scenarios, and compare them with the best
has to be left for future work. model obtained in the baseline experiments. All the scenarios
2) Baseline Approach: LSTM vs GRU: LSTMs have shown perform male to male conversions for the same source and
very good results on a wide variety of problems, but the target speakers.
recently proposed GRUs are stated to have similar results The first scenario allows the model prediction process to
to LSTMs. In addition, due to the GRU more compressed have the same information it has during training, i.e. we allow
architecture, it demands less usage of computational power the model to access yt−1 information from the target. The
and has slightly faster convergence times. In this set of ex- purpose of this experiment is to evaluate the model’s behaviour
periments, we intended to compare the performances between during training and evaluate how well it is learning from the
the DBLSTM and the deep bidirectional gated recurrent unit data.
(DBGRU) models. In the second scenario the model is not allowed to access
yt−1 from the target, and has to output its own predictions ŷt−1
TABLE II and use them iteratively to generate the whole sequence. The
M EL CEPSTRAL DISTORTION VALUES COMPARISON BETWEEN THE USE OF goal of this experiment is to evaluate the error propagation
A GRU CELL VERSUS THE USE OF AN LSTM CELL IN THE DBRNN
MODEL . of the model throughout the whole sequence, due to the
prediction errors.
MCD (dB) Finally, in the third and last scenario, the model from
Model Source Target Train Validation Test the first and second scenarios is retrained by progressively
DBLSTM bdl (♂) rms (♂) 5.82 6.58 6.54 introducing the model’s predictions instead of the target values
DBGRU bdl (♂) rms (♂) 6.09 6.82 6.77 of yt , using a scheduled sampling technique. With this final
scenario, the model’s capacity of dealing with its own errors
From table II it is possible to observe that the LSTM may be evaluated.
performed slightly better than the GRU. Nevertheless, the
GRU may be ideal in a scenario with less computational power TABLE IV
M EL CEPSTRAL DISTORTION VALUES COMPARISON MALE TO MALE
or less GPU memory, in order to obtain a similar performance CONVERSION USING THE PROPOSED SEQUENCE TO SEQUENCE ATTENTION
to what is obtained with the usage of an LSTM, but with a MODEL .
lighter setup.
3) Baseline Approach: Masking vs No Masking: In the MCD (dB)
baseline architectures, the authors did not reveal the use of Model Source Target Train Validation Test
any masking techniques for the training of their model. Since DBLSTM Wide bdl (♂) rms (♂) 5.72 6.57 6.51
the DBLSTM model processes batches of padded sequences, Seq2Seq Attention [S1] bdl (♂) rms (♂) 3.70 3.84 3.92
Seq2Seq Attention [S2] bdl (♂) rms (♂) 16.48 17.19 16.29
we argue that masking has a relevant role on improving Seq2Seq Attention [S3] bdl (♂) rms (♂) 14.99 16.11 16.25
the model’s performance and avoiding modelling unnecessary
information. We experimented with two different scenarios,
The results from the experiments are presented in table IV.
one in which we mask the training and validation padded
The sequence to sequence attention models have 6-layer of
sequences, and one in which the sequences are padded but not
depth and were trained with a [128, 256, 256, 256, 256, 128]
masked. Both scenarios were tested using a DBLSTM model
hidden size configuration. Training these models took about
with the original baseline hidden size configuration.
60 epochs and a total of 50 hours, using a batch size of 10
sentences on an NVidia Titan X GPU.
TABLE III
M EL CEPSTRAL DISTORTION VALUES COMPARISON BETWEEN THE USE OF In the first scenario (S1), it is observed that the model
MASKING WITH THE DBLSTM MODEL . outperforms the baseline approach, reaching MCD values that
are about half the distortions from the DBLSTM model.
MCD (dB) Nonetheless, these results can only be used as an indicator
Model Source Target Train Validation Test of how well the model has learned the data, which has shown
DBLSTM - masking bdl (♂) rms (♂) 5.82 6.58 6.54 to be promising, although in a real prediction situation, it is
DBLSTM - no masking bdl (♂) rms (♂) 5.85 6.61 6.55 not possible to access the true values of y.
In the second scenario (S2), it is clearly noticeable that
Observing the results shown in table III, it is clear that the model struggles and does not handle well its own errors.
masking has an impact in the model’s performance, since it The converted speech can barely be identified as speech, as
has shown lower Mel cepstral distortion values throughout it sounds muffled. This behaviour can be associated with the
all the sets it was tested on. Nevertheless, it is relevant to model relying too much on a correct input coming in from
point out that the bucketing technique was used for both yt−1 , which can immediately lead to errors in the first few time
cases, and therefore sequences with similar lengths were steps and propagate throughout the remainder of the sequence.
8
translation, which resorts to the encoder decoder and attention to rely on the previous target input, we suggest the
architectures, we explored the same model concepts and ap- introduction of a separate neural network just to deal with
plied it to our spectral mapping problem. However, there were the prediction errors and minimize the error propagation
some significant conceptual differences between the machine throughout the sequence.
translation and spectral mapping problems, such as the real • Output Conditioning: in order to narrow down the
valued outputs and the considerably higher number of time output decisions, we suggest to add a text representation
steps. to condition the output. Work from [23] has reportedly
Our architecture relies on a stack of GRU encoders to achieved successful results decoding a real-valued se-
encode the sequence into a higher dimensional representation, quence with output conditioning, by using LSTMs. The
also known as a context. The context is decoded step by step same concept could be applied to voice conversion.
by a decoder with an attention mechanism so that the model • Discrete Features: another possibility would be to quan-
can focus on different interest points from the input. While tize the output features to be able to apply the same deci-
doing this, the model is at the same time aligning both input sion techniques during the prediction phase, such as beam
and output sequences. search. Techniques such as vector quantization could be
All the experiments from both the implemented models applied to achieve a discrete feature space. However, this
were performed on the CMU ARCTIC corpus and the results approach might be introduce other limitations, due to the
were evaluated using a Mel Cepstral Distortion objective way the quantization is done. In comparison with the
measure. It was observed that our implementation of the error introduced by the DTW, the quantization may do
baseline approach achieved very similar results to the ones more harm than good.
reported in the original paper. However, no subjective measure • Alignment Information: since the model’s behaviour
was performed for comparison. hints that the information provided is not being enough
The experiments with the sequence to sequence attention for it to deal with the predictions, it would be interest-
model revealed promising results during training, when al- ing to explore the possibility of providing just enough
lowed to peek at the target information. Nonetheless, the information about the start and end of each phoneme.
experiments in a full prediction environment where target This way, the model could improve its learning ability
information was not accessible, turned out to have a poor in deciding whether to trust its previous prediction or
performance. These results were attributed to the difficulties not. This information wouldn’t affect significantly the
in error handling during the prediction stage, together with the model’s performance, since this information can easily
considerable number of time steps. Despite this turn of events, be compressed.
there is still not enough evidence to disprove the success of a Taking our focus away from the sequence to sequence
sequence to sequence model with an attention mechanism to models, there are a few interesting new developments that
deal with regression problem in a continuous feature space, as can be interesting for the future of voice conversion systems
there are more experiments that should be conducted. with deep learning. Recently, WaveNet [24] was proposed
While implementing the baseline and proposed models, we by Google’s Deepmind, presenting a revolutionary state of
developed a Python framework based on Theano to support the art technique for text-to-speech synthesis that outperforms
recurrent neural networks, sequence to sequence models and every other current techniques. It would be interesting to
attention models. This framework will be available for an explore WaveNet’s architecture and develop a voice conversion
academic environment to allow anyone to easily apply these system on top of it. Moreover, a many-to-one voice conversion
type of models to their own problems. framework using DBLSTMs was proposed in [25], using
Ultimately, the two main contributions of this paper are Phonetic PosteriorGrams (PPGs) obtained from a speaker-
the study of the behaviour of attention models with real independent automatic speech recognition (SI-ASR) system
valued inputs and outputs, on which there were still no reports and without the need for parallel data, which might be worth
available, and the development of the sequence to sequence looking into as well.
Theano framework. The developed deep learning framework is currently under
development in order to make it more robust to user errors,
improve overall code quality and user experience. The frame-
B. Future Work
work should be publicly available via github soon.
There are a few possible future work directions this research
can take. Taking into account the further development of the A PPENDIX A
model concept we proposed, we suggest a few techniques L2F D EEP L EARNING : A D EEP L EARNING F RAMEWORK
that can be tried out to help to mitigate the error propagation In order to address the deep learning stage of this thesis,
problem: several frameworks were considered, for example, Keras [26]
• Hybrid attention: the attention mechanism could benefit and Blocks [27], among others. However, the flexibility we
from a smoothing, or a hybrid attention mechanism, in encountered in these frameworks (at least at the time of this
order to block the attention weights to disperse through thesis realization) did not met our requirements. Therefore, we
further neighbouring frames which can cause problems. developed a small framework for personal usage, although it
• Error handling network: to learn from the error of is currently being developed into a cleaner version for general
the model predicted and to prevent the whole model usage.
10
The framework was built on top of the well known math- [5] H. T. Hwang, Y. Tsao, H. M. Wang, Y. R. Wang, and S. H. Chen,
ematical framework Theano [21], this way supporting all the “A study of mutual information for gmm-based spectral conversion,”
in INTERSPEECH 2012, 13th Annual Conference of the International
sound mathematical operations required by the models and Speech Communication Association, Portland, Oregon, USA, September
layers developed. Developing right on top of Theano allows 9-13, 2012, 2012, pp. 78–81.
a greater flexibility versus the other frameworks considered [6] E. Helander, H. Silén, T. Virtanen, and M. Gabbouj, “Voice conversion
using dynamic kernel partial least squares regression,” IEEE Trans.
(some are also built on top of this mathematical framework), Audio, Speech & Language Processing, vol. 20, no. 3, pp. 806–817,
although this also implied that some features already on those 2012.
frameworks had to be implemented. Some of these features, [7] Z. Wu, E. Chng, and H. Li, “Exemplar-based voice conversion using
joint nonnegative matrix factorization,” Multimedia Tools Appl., vol. 74,
such as, optimizers, were borrowed and put in our project with no. 22, pp. 9943–9958, 2015.
the reference and credit from where they were developed. [8] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using
The proposed framework is structured into four main mod- speaker-dependent conditional restricted boltzmann machine,” EURASIP
J. Audio, Speech and Music Processing, vol. 2015, p. 8, 2015.
ules: layers, models, optimization, and utils. The [9] L. Sun, S. Kang, K. Li, and H. M. Meng, “Voice conversion using deep
layers module contains the layer class template, which we bidirectional long short-term memory based recurrent neural networks,”
call GenericLayer, and implemented layers that are built in in 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April
into the framework and that should be used through a model. 19-24, 2015, 2015, pp. 4869–4873.
The models module contains the implemented models we [10] L. R. Rabiner and B. Juang, Fundamentals of speech recognition, ser.
developed for this framework, and which can be trained by the Prentice Hall signal processing series. Prentice Hall, 1993.
[11] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares,
user with a few lines of code. The optimization module H. Schwenk, and Y. Bengio, “Learning phrase representations using
content includes everything that is related to model optimiza- RNN encoder-decoder for statistical machine translation,” in Proceed-
tion, such as optimizer algorithms (e.g. ADAM, RMSProp, ings of the 2014 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting
SGD), batch update training, and regularization techniques like of SIGDAT, a Special Interest Group of the ACL, 2014, pp. 1724–1734.
dropout. Finally, the utils module deals with everything [12] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
else that is not included in the remaining modules, such as jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
[13] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term dependen-
decay functions for a curriculum learning train, data handlers cies with gradient descent is difficult,” IEEE Trans. Neural Networks,
templates and Theano variables initializations. vol. 5, no. 2, pp. 157–166, 1994.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
In our proposed framework, the user is able to implement Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
its own layer, although there are a few implemented built-in [15] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation
layers: of gated recurrent neural networks on sequence modeling,” CoRR, vol.
abs/1412.3555, 2014.
• Attention mechanism [16] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalized
• Fully Connected Layer (Dense) cepstral analysis - a unified approach to speech spectral estimation,”
in The 3rd International Conference on Spoken Language Processing,
• Time Distributed Fully Connected Layer (Time Dis- ICSLP 1994, Yokohama, Japan, September 18-22, 1994, 1994.
tributed Dense) [17] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign, “Restructuring
• GRU speech representations using a pitch-adaptive timefrequency smoothing
and an instantaneous-frequency-based {F0} extraction: Possible role of
• Bidirectional GRU a repetitive structure in sounds1,” Speech Communication, vol. 27, no.
• LSTM 34, pp. 187 – 207, 1999.
• Bidirectional LSTM [18] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Advances in Neural
• GRU Decoder Information Processing Systems 28: Annual Conference on Neural
• LSTM Decoder Information Processing Systems 2015, December 7-12, 2015, Montreal,
• GRU Attention Decoder Quebec, Canada, 2015, pp. 577–585.
[19] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling
In addition, the model classes used in this work are also for sequence prediction with recurrent neural networks,” CoRR, vol.
abs/1506.03099, 2015.
available for immediate use, such as the DBLSTM, DBGRU [20] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in
and deep sequence to sequence with attention. Fifth ISCA ITRW on Speech Synthesis, Pittsburgh, PA, USA, June 14-16,
2004, 2004, pp. 223–224.
[21] Theano Development Team, “Theano: A Python framework for
fast computation of mathematical expressions,” arXiv e-prints, vol.
R EFERENCES abs/1605.02688, May 2016.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[1] Z. Wu, “Spectral mapping for voice conversion,” Ph.D. dissertation, CoRR, vol. abs/1412.6980, 2014.
Nanyang Technological University, 3 2015. [23] A. Graves, “Generating sequences with recurrent neural networks,”
[2] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech CoRR, vol. abs/1308.0850, 2013.
synthesis,” in Proceedings of the 1998 IEEE International Conference [24] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
on Acoustics, Speech and Signal Processing, ICASSP ’98, Seattle, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
Washington, USA, May 12-15, 1998, 1998, pp. 285–288. A generative model for raw audio,” CoRR, vol. abs/1609.03499v2, 2016.
[3] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on [25] L. Sun, K. Li, H. Wang, S. Kang, and H. M. Meng, “Phonetic
maximum-likelihood estimation of spectral parameter trajectory,” IEEE posteriorgrams for many-to-one voice conversion without parallel data
Trans. Audio, Speech & Language Processing, vol. 15, no. 8, pp. 2222– training,” in IEEE International Conference on Multimedia and Expo,
2235, 2007. ICME 2016, Seattle, WA, USA, July 11-15, 2016, 2016, pp. 1–6.
[4] H. Hwang, Y. Tsao, H. Wang, Y. Wang, and S. Chen, “Incorporating [26] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.
global variance in the training phase of gmm-based voice conversion,” [27] B. van Merriënboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-
in Asia-Pacific Signal and Information Processing Association Annual Farley, J. Chorowski, and Y. Bengio, “Blocks and fuel: Frameworks for
Summit and Conference, APSIPA 2013, Kaohsiung, Taiwan, October 29 deep learning,” CoRR, vol. abs/1506.00619, 2015. [Online]. Available:
- November 1, 2013, 2013, pp. 1–6. http://arxiv.org/abs/1506.00619