2005.03295v2
2005.03295v2
Transcription T
Text Encoder
speakers from the VCTK dataset, we outperform the previ-
ous method in terms of both naturalness and speaker similarity.
Our system can also convert speech from speakers that are un-
seen during training, and utilize ASR to automate the transcrip-
tion with minimal reduction of the performance. Audio sam-
ples are available at https://mindslab-ai.github.
io/cotatron, and the code with a pre-trained model will be
made available soon. matmul
Index Terms: voice conversion, speech synthesis, speech rep-
resentation, disentangled representation. Speaker‐ind. Linguistic Feat. L