0% found this document useful (0 votes)
3 views

2005.03295v2

Cotatron is a transcription-guided speech encoder designed for speaker-independent linguistic representation, enabling any-to-many voice conversion without the need for parallel data. It outperforms previous methods in naturalness and speaker similarity by utilizing a multispeaker TTS architecture and can convert speech from unseen speakers while automating transcription with ASR. The system is trained on the VCTK dataset and demonstrates effective voice conversion capabilities through a combination of Cotatron features and a residual encoder.

Uploaded by

cit.dms1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2005.03295v2

Cotatron is a transcription-guided speech encoder designed for speaker-independent linguistic representation, enabling any-to-many voice conversion without the need for parallel data. It outperforms previous methods in naturalness and speaker similarity by utilizing a multispeaker TTS architecture and can convert speech from unseen speakers while automating transcription with ASR. The system is trained on the VCTK dataset and demonstrates effective voice conversion capabilities through a combination of Cotatron features and a residual encoder.

Uploaded by

cit.dms1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Cotatron: Transcription-Guided Speech Encoder

for Any-to-Many Voice Conversion without Parallel Data


Seung-won Park1,2 , Doo-young Kim1,2 , Myun-chul Joe2
1 2
Seoul National University MINDsLab Inc.
{swpark, dykim, mcjoe}@mindslab.ai

Abstract Spk. Mel Spectrogram M0:i−1


We propose Cotatron, a transcription-guided speech encoder
Teacher‐force
for speaker-independent linguistic representation. Cotatron is
concat
based on the multispeaker TTS architecture and can be trained z id Autoregressive TTS Decoder
with conventional TTS datasets. We train a voice conversion
arXiv:2005.03295v2 [eess.AS] 14 Aug 2020

system to reconstruct speech with Cotatron features, which is M̂1:i


similar to the previous methods based on Phonetic Posterior-

^is there a new way forward?$


gram (PPG). By training and evaluating our system with 108 Alignment A

Transcription T

Text Encoder
speakers from the VCTK dataset, we outperform the previ-
ous method in terms of both naturalness and speaker similarity.
Our system can also convert speech from speakers that are un-
seen during training, and utilize ASR to automate the transcrip-
tion with minimal reduction of the performance. Audio sam-
ples are available at https://mindslab-ai.github.
io/cotatron, and the code with a pre-trained model will be
made available soon. matmul
Index Terms: voice conversion, speech synthesis, speech rep-
resentation, disentangled representation. Speaker‐ind. Linguistic Feat. L

1. Introduction Figure 1: Cotatron architecture. The alignment between mel


Recent advances in voice conversion (VC) have shown poten- spectrogram and its transcription is obtained via a pre-trained
tial for a wide variety of applications, such as enhancement multispeaker TTS (Tacotron2) and then combined with text en-
of impaired speech or entertainment purposes. To switch the coding to extract speaker-independent linguistic features. Spk.
source speech’s speaker identity to that of the target speaker, denotes the speaker encoder.
the system should be able to encode speaker-independent (e.g.,
linguistic) features from given speech, and then pair them with a
speaker representation to reconstruct the speech. Phonetic Pos- In this paper, we propose Cotatron, a transcription-guided
teriorgram (PPG) [1], a speaker-independent feature extracted speech encoder based on a pre-trained multispeaker TTS model
with speaker-independent ASR, had been widely used for non- [17, 18]. Cotatron encodes an arbitrary speaker’s speech into
parallel voice conversion [2, 3, 4, 5]. However, PPG-based speaker-independent linguistic features, which are fed to a de-
methods usually required additional acoustic features from au- coder for non-parallel, any-to-many voice conversion. Our
dio analysis, which may indicate that the PPG itself is insuffi- Cotatron-based voice conversion system outperforms the previ-
cient to encode rich linguistic features of human speech. ous state-of-the-art method, Blow [12], in terms of both natural-
One way to encode speaker-independent features without ness and speaker similarity scores on a user study, when trained
discarding essential factors of the speech is to train a speech and evaluated with 108 speakers from VCTK dataset [19].
encoder with some restrictions. For example, Qian et al. [6]
showed that an autoencoder with a carefully tuned bottleneck 2. Approach
can effectively encode speaker-independent features without 2.1. Speaker-independent linguistic features from TTS
losing content information. Other prior works on restricting the
speech encoders include: propagating reversed gradient from Cotatron is guided with a transcription to extract speaker-
the speaker classifier [7], applying instance normalization [8], independent linguistic features from the speech. Cotatron’s ba-
quantizing the representation [9, 10, 11], and training a condi- sic architecture is identical to multispeaker Tacotron2 [17, 18];
tioned flow-based generative model [12]. However, these meth- it jointly learns to align and predict next mel frame from the text
ods required the model to discover linguistic representations by encoding, previous mel frame, and the speaker representation:
itself, although transcriptions were available within the dataset.  
Initial attempts on incorporating text supervision to voice M̂1:i , Ai = Decoder Encoder (T ) , M0:i−1 , z id , (1)
tts text
conversion system trained an auxiliary ASR decoder [13, 14,
15], or shared the model weights with TTS [16]. Unfortunately, where T, M, A, z id corresponds to text, log mel spectrogram,
these methods only dealt with a limited number of speakers or alignment, and the speaker representation, respectively.
required huge amounts of data for each speaker; making their After training, a simple yet effective trick is applied. An
effectiveness on the real-world applications questionable. alignment A between the speech and the transcription is ob-
Speaker‐independent
tained via feeding all frames of the mel spectrogram into representation
Cotatron with teacher-forcing applied. Then, the speaker-
independent linguistic features of the speech are obtained from Transcription Ts Cotatron Ls Converted mel
a matrix multiplication of the alignment and text encoding as Decoder Ms→∗
Fig. 1:
  Source mel Ms Residual Rs
L = matmul A, Encoder(T ) . (2)
text
Target speaker id
Per definition, the text encoding contains no speaker informa- ∗ = s (Training) ∗ Lookup Table y∗id
∗ = t (Conversion)
tion. Besides, the text-audio alignment A is a set of scalar coef-
ficients for weighted summation over encoder timesteps of the (a) Voice Conversion system with Cotatron.
text encoding. Hence, we may argue that the Cotatron features
L do not explicitly contain a source speaker’s information. We Source Mel Ms concat(Ls , Rs )
y∗id
show the degree of speaker disentanglement at Sec. 4.3.
Cotatron features are naturally adequate for synthesizing 6‐Layer Conv2D w/ BN, s=(2, 1)
speech from a large number of speakers; the features can be Conv1D, c=512, k=7
interpreted as context vectors for Tacotron2’s attention mecha- AvgPool across freq. dimension
nism, which are already optimized for multispeaker speech syn-
thesis. We further expand the coverage of source speakers into Projection (128 → 1) 4× GBlock w/ Conditional BN
arbitrary by replacing the embedding table into an encoder for
speaker representation z id . The speaker encoder is composed of Instance Norm, tanh
6 layers of 2D CNN, following the reference encoder architec- Conv1D, c=80, k=1
ture from Skerry-Ryan et al. [20]. Each layer had 3 × 3 kernel, Smoothing w/ Hann, k=21
2 × 2 stride with 32, 32, 64, 64, 128, 128 channels. The CNN
output is flattened and passed through a 256-unit GRU to obtain Residual Features Rs Converted Mel Ms→∗
the fixed-length speaker representation from the final state.
(b) Residual encoder. (c) VC decoder.
2.2. Voice conversion
Figure 2: Network architectures. s, c, k denotes the stride, num-
2.2.1. Residual Encoder ber of channels, kernel size of convolution layer, respectively.
Speaker representation y id conditions the VC decoder via con-
Let’s consider a decoder reconstructing the speech from Co- ditional batch normalization layer within residual blocks. Refer
tatron features. Even when the rhythm of the transcription is to Binkowski et al. [22] for the detailed architecture of GBlock.
given via Cotatron features, other components of the speech
may vary. For example, the intonation may vary within the
speech of the same text with identical rhythm. It is therefore
insufficient for the decoder to use only the Cotatron features The asterisk symbol can be either s or t, each representing
and the speaker representation. To fill the gap of information, source/target of the voice conversion. Thus, Ms→s denotes re-
we design an encoder to provide decoder a residual feature R. construction, and Ms→t denotes voice conversion from s to t.
The residual encoder (Fig. 2b) is built with 6 layers of 2D Following the model architecture of GAN-TTS [22], the
CNN as the speaker encoder did, but strides are not applied VC decoder is constructed with a stack of four GBlocks without
across time to preserve the temporal dimension of the mel spec- upsampling. Each GBlock has 512, 384, 256, 192 channels, re-
trogram. Each layer had 3 × 3 kernel with 2 × 1 stride and 32, spectively. For speaker conditioning, the embedding of the tar-
32, 64, 64, 128, 128 channels. If the dimension of the residual get speaker y id is injected via a conditional batch normalization
features is too wide, the residual encoder may learn to cheat by layer [23] within the GBlocks, after an affine transformation.
encoding the information that is related to the individual speaker We empirically observed that concatenation of speaker embed-
– e.g., absolute pitch. We find that a single-channeled output ding leads to worse results. Neither the hyper-conditioning [24]
helps to prevent the residual features from containing character- nor the weight demodulation [25] did not help. There might
istic of the individual speaker, and is enough to represent resid- be room for improvement in design choices of decoder archi-
ual information of the speech; this approach was also used by tecture, but we leave it as a future work since it is beyond the
Lian et al. [4]. After projecting to a single channel, instance scope of this work.
normalization [21] is applied to prevent the residual represen- Note that the VC decoder is only trained to reconstruct
tation from containing speaker-dependent information. Finally, the mel spectrogram with representations from the identical
the values are smoothed after tanh activation by applying con- speaker. Though it is possible to directly train the conversion
volution with a Hann function of window size 21. in an adversarial manner, we show the effectiveness of Cotatron
on voice conversion using only reconstruction loss.
2.2.2. VC Decoder
3. Experimental setup
The decoder for voice conversion (Fig. 2c) is trained to recon-
struct the mel spectrogram from a given pair of information; 3.1. Dataset
the Cotatron features L and the residual feature R are concate-
Our voice conversion system is trained and evaluated with a
nated channel-wise, and then conditioned with 256-dimension
VCTK dataset [19], which consists of 46 hours of English
speaker embedding y id retrieved from a lookup table as:
speech from 108 speakers. Similar to what Blow had done [12],
  we split the data into train, validation, and test splits by ran-
Ms→∗ = Decoder concat (Ls , Rs ) , y∗id . (3) domly selecting 80%, 10%, 10% of the data, respectively. To
vc
prevent overlap of transcription between data splits, the data is 3.3. Conversion
split with respect to their transcription, not the number of files.
To convert one voice to another, we first extract the speaker-
To stabilize the training of multispeaker TTS, we incorpo-
independent features, Ls , Rs , from the source speech with Co-
rate a subset of LibriTTS [26], which is a dataset specialized in
tatron and residual encoder, respectively. Then, the embedding
training TTS systems. Speakers with more than 5 minutes of
of the target speaker ytid is retrieved from the lookup table. Fi-
speech are chosen from LibriTTS’ train-clean-100 subset.
nally, a pair of speaker-independent features and target speaker
All audios longer than 10 seconds are not used for train-
embedding is used to produce a converted mel spectrogram,
ing to allow efficient batching. The audios are resampled to
Ms→t . The resulting mel spectrogram is then inverted into raw
sampling rate 22.05 kHz and then normalized without silence
audio using MelGAN [28], which is trained with LibriTTS train
removal. The statistics of the dataset are shown in Table 1.
split and then fine-tuned with the entire VCTK dataset.
Table 1: Dataset statistics. For LibriTTS train-clean-100 split,
speakers with less than 5 minutes of speech are removed. 3.4. Implementation details
For robust alignment stability against length variation, we apply
Dataset # speakers Length (h) the Dynamic Convolution Attention (DCA) mechanism [29].
VCTK [19] train / val / test 108 34.6 / 4.5 / 4.2 The speaker representation is extracted from the ground-truth
mel spectrogram with the speaker encoder, and then repeat-
LibriTTS [26] edly concatenated with text encoder output to feed the auto-
train-clean-100 123 23.4 regressive decoder of Cotatron. For both Cotatron and the voice
dev-clean 40 9.0 conversion system, the training data is augmented with repre-
test-clean 39 8.6 sentation mixing [30], i.e., graphemes are randomly replaced
with phonemes if the word is available in CMUdict [31]. Both
decoders produce 80-bin log mel spectrogram, which is com-
3.2. Training puted from 22.05 kHz raw audio using STFT with window size
1024, hop size 256, Hann window, and a mel filterbank span-
3.2.1. Cotatron ning from 70 Hz to 8000 Hz. The voice conversion systems are
Cotatron is trained with the aforementioned subset of LibriTTS, implemented with PyTorch [32] and trained for 10 days with
which is based on the train-clean-100 split. Then, the model is two NVIDIA V100 (32GB) GPU using data parallelism.
transferred to learn with both LibriTTS and VCTK train split.
To enhance the stability of text-audio alignment learning, the 3.5. Evaluation metrics
autoregressive decoder is teacher-forced with a rate of 0.5, i.e., We validate the effectiveness of our method with both subjective
input mel frame is randomly selected from either ground truth and objective metrics, using 100 and 10,000 audio samples per
frame or previously generated frame. Furthermore, we find it each measurement, respectively.
helpful to train extra MLP with dropout for speaker classifica-
tion on top of z id from the speaker encoder, using cross-entropy Mean Opinion Score (MOS). To assess the naturalness of con-
loss Lid . Overall, Cotatron is trained with the sum of mel spec- verted speech, we measure the mean opinion score (MOS) on a
trogram reconstruction loss and speaker classification loss: 5-point scale at Amazon Mechanical Turk (MTurk). A total
2 2
of 100 audio samples are generated for each case with a ran-
Lcotatron = M̂s,pre − Ms + M̂s,post − Ms + Lid , (4) dom pair of source speech and target speaker, which contains
2 2
all possible gender combinations. The audio samples from our
where M̂s,pre and M̂s,post denote output before and after the method and natural speech are downsampled to rate 16 kHz to
Cotatron’s post-net [17], respectively. match the results from Blow [12]. Each sample is assigned to 5
Throughout the training process, Adam optimizer [27] is human listeners, and the highest/lowest score is discarded.
used with batch size 64. The initial learning rate 3 × 10−4 is
Degradation Mean Opinion Score (DMOS). Another user
used for the first 25k steps and then exponentially decayed to
study is done to assess speaker similarity between the converted
1.5 × 10−5 for the next 25k steps. After the model converges
speech and the target speaker’s original recording. The degrada-
with LibriTTS, we add VCTK and reuse the learning rate decay
tion mean opinion score (DMOS) on a 5-point scale is measured
scheme. Weight decay of 1 × 10−6 is used for Adam optimizer,
at MTurk with the same settings from the MOS experiment.
and the gradient is clipped to 1.0 to prevent gradient explosion.
Speaker Classification Accuracy (SCA). Our system should
3.2.2. Mel-Spectrogram Reconstruction be able to fool the speaker classifier as if the converted speech
was spoken from the target speaker. The speaker classifier is
After the training of Cotatron, the components for voice conver-
an MFCC-based single-layer classifier, which is identical to the
sion system is trained on top of Cotatron features. The residual
one used with Blow [12] for a fair comparison. The classi-
encoder and the VC decoder is jointly trained with mel spectro-
fier is trained with 108 speakers from the VCTK train split and
gram reconstruction loss:
achieved 99.4% top-1 accuracy on the test split. The MFCC is
Lvc = kMs→s − Ms k22 . (5) directly calculated from the log mel spectrogram if possible.
During the reconstruction training phase, Cotatron is set to eval- Voicing Decision Error (VDE). As a proxy metric for content
uation mode; all dropout layers are turned off, and the autore- consistency between source and converted speech, we measure
gressive decoder is always teacher-forced to provide consistent the rate of voicing decision match between them, adapting a
features for VC decoder. Adam optimizer with constant learn- metric of end-to-end prosody transfer for speech synthesis [20].
ing rate 3 × 10−4 is used with weight decay 1 × 10−6 and The voicing decision is obtained via rVAD [33] with a VAD
batch size 128. Gradient clipping is not used here. threshold value set to 0.7.
4. Results Table 3: Results of any-to-many conversion and using ASR tran-
scription. The values are expected to be similar across the rows.
4.1. Many-to-many conversion
We compare our system with Blow [12], which is the only liter- Input Transcription MOS SCA VDE
ature to date on many-to-many voice conversion with all speak-
VCTK test → VCTK test (many-to-many)
ers of VCTK. As presented in Table 2, our system shows sig-
1-a. ground truth 3.41 ± 0.14 78.5% 2.98%
nificantly better results on both MOS and DMOS than Blow,
1-b. ASR (WER 12.6%) 3.44 ± 0.12 77.8% 3.03%
even when only the Cotatron features are used without residual
features. Incorporating the residual encoder on our system has LibriTTS test-clean → VCTK test (any-to-many)
further enhanced the MOS. It should, however, be noted that the 2-a. ground truth 2.84 ± 0.14 73.6% 11.9%
objective results on speaker similarity (SCA) are contradicting 2-b. ASR (WER 7.0%) 2.83 ± 0.15 71.7% 11.7%
that from the subjective results (DMOS). Future work should,
therefore, revisit and establish objective speaker similarity met-
rics for voice conversion systems. As shown in Table 4, the SCA with Cotatron features and
the residual features are significantly lower than that from the
Table 2: Results of many-to-many voice conversion. source mel spectrogram. These results indicate that our method
effectively disentangles the speaker’s identity from the speech,
Approach MOS DMOS SCA while it is noteworthy to mention that the network was slightly
able to guess the speaker using only Cotatron features.
Source as target 4.28 ± 0.11 1.71 ± 0.22 0.9%
Target as target 4.28 ± 0.11 4.78 ± 0.08 99.4% Table 4: Degree of speaker disentanglement.
Blow 2.41 ± 0.14 1.95 ± 0.16 86.8%
Cotatron (ours) Input Feature Random Ls (Ls , Rs ) Ms
w/o residual 3.18 ± 0.14 4.06 ± 0.17 73.3%
full model 3.41 ± 0.14 3.89 ± 0.18 78.5% SCA 0.9% 35.2% 54.0% 97.9%

4.2. Any-to-many conversion and the use of ASR 5. Discussion


Considering the technical demands of the real-world applica- In this paper, we proposed Cotatron, a transcription-guided
tions, we further explore the generalization power of our voice speech encoder for speaker-independent linguistic representa-
conversion system. First, we consider any-to-many setting – tion, which is based on the multispeaker Tacotron2 architecture.
i.e., converting arbitrary speakers’ speech to that of speakers Our Cotatron-based voice conversion system reaches state-of-
that are seen during training. Next, we inspect the reliability the-art performance in terms of both naturalness and speaker
of using ASR transcription, which enables a fully automatic similarity on conversion across 108 speakers from the VCTK
pipeline of our system without manual transcription. For any- dataset and shows promising results on conversion from arbi-
to-many conversion experiment, we randomly sample speeches trary speakers. Even when the automated transcription with er-
from LibriTTS test-clean split and convert them into speakers rors is fed, the performances remained the same.
of VCTK. For ASR, wav2letter++ [34, 35] is used. To our best knowledge, Cotatron is the first model to encode
In Table 3, we present the MOS, SCA, and VDE for all the speaker-independent linguistic representation by explicitly
possible cases of input. First, all of the MOS results are much aligning the transcription with given speech. This could open a
better than the previous method in Table 2, though the scores new path towards multi-modal approaches for speech process-
from any-to-many setting are slightly lower than that of many- ing tasks, where only the speech modality was usually being
to-many setting. Next, the differences of SCA across the cases used. For example, one may consider training a transcription-
are negligible, and the values of VDE are minimal when con- guided speech enhancement system based on Cotatron features.
sidering the accuracy of the VAD module. These results sug- Furthermore, traditional speech features that were utilized for
gest that the conversion quality is rather unaffected by using lip motion synthesis can be possibly replaced with Cotatron fea-
(1) source speech from speakers that are unseen during train- tures to incorporate the transcription for better quality.
ing, and/or (2) automated transcription from ASR. Besides, it Still, there is plenty of room for improvement in the voice
is surprising to observe that the word errors of automated tran- conversion system with Cotatron. Despite our careful design
scription do not damage the performance; this would seem to choices, the residual encoder seems to provide speech features
suggest that most of the transcription errors originate from their that are entangled with speaker identity, which may harm the
homophones, e.g., site is often wrongly transcribed as sight. conversion quality or even cause mispronunciation issues. Be-
sides, methods for conditioning the target speaker’s represen-
4.3. Degree of disentanglement tation could be possibly changed; e.g., utilizing a pre-trained
speaker verification network as a speaker encoder may enable
To quantify the degree of speaker disentanglement of features any-to-any conversion with our system.
from Cotatron and the residual encoder, we additionally train a
neural network for classifying speakers from the VCTK dataset
with a given set of features. In the case of ideal speaker disen-
6. Acknowledgments
tanglement, the SCA will be close to that of random guessing: The authors would like to thank Gaku Kotani from U. Tokyo,
0.9%. Each classification network is built with 4 layers of 1D June Young Yi, and Junhyeok Lee from MindsLab Inc., and
CNN and batch normalization, followed by the temporal max- other reviewers who elected to remain anonymous for providing
pooling layer and MLP with dropout. beneficial feedback on the initial version of this paper.
7. References [18] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,
R. Pang, I. L. Moreno, Y. Wu et al., “Transfer learning from
[1] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic poste- speaker verification to multispeaker text-to-speech synthesis,” in
riorgrams for many-to-one voice conversion without parallel data Advances in neural information processing systems, 2018, pp.
training,” in 2016 IEEE International Conference on Multimedia 4480–4490.
and Expo (ICME). IEEE, 2016, pp. 1–6.
[19] J. Yamagishi, C. Veaux, K. MacDonald et al., “Cstr vctk corpus:
[2] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel
English multi-speaker corpus for cstr voice cloning toolkit (ver-
voice conversion using variational autoencoders conditioned by
sion 0.92),” 2019.
phonetic posteriorgrams and d-vectors,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing [20] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stan-
(ICASSP). IEEE, 2018, pp. 5274–5278. ton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards
[3] C.-c. Yeh, P.-c. Hsu, J.-c. Chou, H.-y. Lee, and L.-s. Lee, end-to-end prosody transfer for expressive speech synthesis with
“Rhythm-flexible voice conversion without parallel data using tacotron,” in International Conference on Machine Learning,
cycle-gan over phoneme posteriorgram sequences,” in 2018 IEEE 2018, pp. 4693–4702.
Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. [21] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Improved texture
274–281. networks: Maximizing quality and diversity in feed-forward styl-
[4] Z. Lian, J. Tao, Z. Wen, B. Liu, Y. Zheng, and R. Zhong, “Towards ization and texture synthesis,” in Proceedings of the IEEE confer-
fine-grained prosody control for voice conversion,” arXiv preprint ence on computer vision and pattern recognition, 2017, pp. 4105–
arXiv:1910.11269, 2019. 4113.
[5] H. Lu, Z. Wu, D. Dai, R. Li, S. Kang, J. Jia, and H. Meng, “One- [22] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen,
shot voice conversion with global speaker embeddings,” Proc. In- N. Casagrande, L. C. Cobo, and K. Simonyan, “High fidelity
terspeech 2019, pp. 669–673, 2019. speech synthesis with adversarial networks,” in International
Conference on Learning Representations (ICLR), 2020.
[6] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-
Johnson, “AutoVC: Zero-shot voice style transfer with only au- [23] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation
toencoder loss,” in Proceedings of the 36th International Confer- for artistic style,” in International Conference on Learning Repre-
ence on Machine Learning, vol. 97. PMLR, 2019, pp. 5210– sentations (ICLR), 2017.
5219. [24] D. Ha, A. Dai, and Q. V. Le, “Hypernetworks,” in International
[7] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice Conference on Learning Representations (ICLR), 2017.
conversion without parallel data by adversarially learning disen- [25] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and
tangled audio representations,” arXiv preprint arXiv:1804.02812, T. Aila, “Analyzing and improving the image quality of stylegan,”
2018. arXiv preprint arXiv:1912.04958, 2019.
[8] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion
[26] H. Zen, R. Clark, R. J. Weiss, V. Dang, Y. Jia, Y. Wu, Y. Zhang,
by separating speaker and content representations with instance
and Z. Chen, “Libritts: A corpus derived from librispeech for text-
normalization,” arXiv preprint arXiv:1904.05742, 2019.
to-speech,” Proc. Interspeech 2019, pp. 1526–1530, 2019.
[9] A. van den Oord, O. Vinyals et al., “Neural discrete representation
learning,” in Advances in Neural Information Processing Systems, [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
2017, pp. 6306–6315. mization,” in International Conference on Learning Representa-
tions (ICLR), 2015.
[10] S. Ding and R. Gutierrez-Osuna, “Group latent embedding for
vector quantized variational autoencoder in non-parallel voice [28] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,
conversion,” Proc. Interspeech 2019, pp. 724–728, 2019. J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Mel-
gan: Generative adversarial networks for conditional waveform
[11] A. T. Liu, P.-c. Hsu, and H.-y. Lee, “Unsupervised end-to-end synthesis,” in Advances in Neural Information Processing Sys-
learning of discrete linguistic units for voice conversion,” Proc. tems, 2019, pp. 14 881–14 892.
Interspeech 2019, pp. 1108–1112, 2019.
[29] E. Battenberg, R. Skerry-Ryan, S. Mariooryad, D. Stanton,
[12] J. Serrà, S. Pascual, and C. S. Perales, “Blow: a single-scale hy-
D. Kao, M. Shannon, and T. Bagby, “Location-relative atten-
perconditioned flow for non-parallel raw-audio voice conversion,”
tion mechanisms for robust long-form speech synthesis,” arXiv
in Advances in Neural Information Processing Systems, 2019, pp.
preprint arXiv:1910.10288, 2019.
6790–6800.
[13] J. Zhang, Z. Ling, Y. Jiang, L. Liu, C. Liang, and L. Dai, “Im- [30] K. Kastner, J. F. Santos, Y. Bengio, and A. Courville, “Represen-
proving sequence-to-sequence voice conversion by adding text- tation mixing for tts synthesis,” in 2019 IEEE International Con-
supervision,” in 2019 IEEE International Conference on Acous- ference on Acoustics, Speech and Signal Processing (ICASSP).
tics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. IEEE, 2019, pp. 5906–5910.
6785–6789. [31] R. L. Weide, “The cmu pronouncing dictionary,” URL:
[14] J. Zhang, Z. Ling, and L.-R. Dai, “Non-parallel sequence- http://www.speech.cs.cmu.edu/cgibin/cmudict, 1998.
to-sequence voice conversion with disentangled linguistic and [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
speaker representations,” IEEE/ACM Transactions on Audio, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch:
Speech, and Language Processing, 2019. An imperative style, high-performance deep learning library,” in
[15] F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanvesky, and Y. Jia, “Par- Advances in Neural Information Processing Systems, 2019, pp.
rotron: An end-to-end speech-to-speech conversion model and its 8024–8035.
applications to hearing-impaired speech and speech separation,” [33] Z.-H. Tan, N. Dehak et al., “rvad: An unsupervised segment-
arXiv preprint arXiv:1904.04169, 2019. based robust voice activity detection method,” Computer Speech
[16] M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint & Language, vol. 59, pp. 1–21, 2020.
training framework for text-to-speech and voice conversion using [34] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve,
multi-source tacotron and wavenet,” Proc. Interspeech 2019, pp. and R. Collobert, “Fully convolutional speech recognition,” arXiv
1298–1302, 2019. preprint arXiv:1812.06864, 2018.
[17] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, [35] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Syn-
Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural naeve, V. Liptchinsky, and R. Collobert, “Wav2letter++: A fast
tts synthesis by conditioning wavenet on mel spectrogram pre- open-source speech recognition system,” in 2019 IEEE Interna-
dictions,” in 2018 IEEE International Conference on Acoustics, tional Conference on Acoustics, Speech and Signal Processing
Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779– (ICASSP). IEEE, 2019, pp. 6460–6464.
4783.

You might also like