0% found this document useful (0 votes)
58 views5 pages

Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model

This document describes a direct speech-to-speech translation model called Translatotron that can translate speech from one language to another without relying on an intermediate text representation. The model is an attention-based sequence-to-sequence neural network trained end-to-end using multitask learning with speech-to-text tasks. Experiments on Spanish-to-English datasets showed the model slightly underperformed a cascade of speech-to-text and text-to-speech models, demonstrating the feasibility of the direct approach.

Uploaded by

Ljubisa Raimovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views5 pages

Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model

This document describes a direct speech-to-speech translation model called Translatotron that can translate speech from one language to another without relying on an intermediate text representation. The model is an attention-based sequence-to-sequence neural network trained end-to-end using multitask learning with speech-to-text tasks. Experiments on Spanish-to-English datasets showed the model slightly underperformed a cascade of speech-to-text and text-to-speech models, demonstrating the feasibility of the direct approach.

Uploaded by

Ljubisa Raimovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia* , Ron J. Weiss* , Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu

Google
{jiaye,ronw}@google.com

Abstract In this paper we demonstrate Translatotron1 , a direct speech-


to-speech translation model which is trained end-to-end. To
We present an attention-based sequence-to-sequence neural net-
facilitate training without predefined alignments, we leverage
work which can directly translate speech from one language into
high level representations of the source or target content in the
speech in another language, without relying on an intermediate
form of transcriptions, essentially multitask training with speech-
text representation. The network is trained end-to-end, learning
to-text tasks. However no intermediate text representation is
to map speech spectrograms into target spectrograms in another
used during inference. The model does not perform as well as a
language, corresponding to the translated content (in a different
baseline cascaded system. Nevertheless, it demonstrates a proof
canonical voice). We further demonstrate the ability to synthe-
arXiv:1904.06037v2 [cs.CL] 25 Jun 2019

of concept and serves as a starting point for future research.


size translated speech using the voice of the source speaker. We
conduct experiments on two Spanish-to-English speech transla- Extensive research has studied methods for combining dif-
tion datasets, and find that the proposed model slightly underper- ferent sub-systems within cascaded speech translation systems.
forms a baseline cascade of a direct speech-to-text translation [5, 6] gave MT access to the lattice of the ASR. [7, 8] integrated
model and a text-to-speech synthesis model, demonstrating the acoustic and translation models using a stochastic finite-state
feasibility of the approach on this very challenging task. transducer which can decode the translated text directly using
Index Terms: speech-to-speech translation, voice transfer, at- Viterbi search. For synthesis, [9] used unsupervised clustering
tention, sequence-to-sequence model, end-to-end model to find F0-based prosody features and transfer intonation from
source speech and target. [10] augmented MT to jointly predict
translated words and emphasis, in order to improve expressive-
1. Introduction ness of the synthesized speech. [11] used a neural network to
We address the task of speech-to-speech translation (S2ST): transfer duration and power from the source speech to the tar-
translating speech in one language into speech in another. This get. [12] transfered source speaker’s voice to the synthesized
application is highly beneficial for breaking down communica- translated speech by mapping hidden Markov model states from
tion barriers between people who do not share a common lan- ASR to TTS. Similarly, recent work on neural TTS has focused
guage. Specifically, we investigate whether it is possible to train on adapting to new voices with limited reference data [13–16].
model to accomplish this task directly, without relying on an in- Initial approaches to end-to-end speech-to-text translation
termediate text representation. This is in contrast to conventional (ST) [17, 18] performed worse than a cascade of an ASR model
S2ST systems which are often broken down into three compo- and an MT model. [19, 20] achieved better end-to-end perfor-
nents: automatic speech recognition (ASR), text-to-text machine mance by leveraging weakly supervised data with multitask
translation (MT), and text-to-speech (TTS) synthesis [1–4]. learning. [21] further showed that use of synthetic training data
Cascaded systems have the potential problem of errors com- can work better than multitask training. In this work we take ad-
pounding between components, e.g. recognition errors leading vantage of both synthetic training targets and multitask training.
to larger translation errors. Direct S2ST models avoid this is- The proposed model resembles recent sequence-to-sequence
sue by training to solve the task end-to-end. They also have models for voice conversion, the task of recreating an utterance
advantages over cascaded systems in terms of reduced computa- in another person’s voice [22–24]. For example, [23] proposes
tional requirements and lower inference latency since only one an attention-based model to generate spectrograms in the target
decoding step is necessary, instead of three. In addition, direct voice based on input features (spectrogram concatenated with
models are naturally capable of retaining paralinguistic and non- ASR bottleneck features) from the source voice. In contrast
linguistic information during translation, e.g. maintaining the to S2ST, the input-output alignment for voice conversion is
source speaker’s voice, emotion, and prosody, in the synthesized simpler and approximately monotonic. [23] also trains models
translated speech. Finally, directly conditioning on the input that are specific to each input-output speaker pair (i.e. one-to-
speech makes it easy to learn to generate fluent pronunciations one conversion), whereas we explore many-to-one and many-
of words which do not need to be translated, such as names. to-many speaker configurations. Finally, [25] demonstrated an
However, solving the direct S2ST task is especially chal- attention-based direct S2ST model on a toy dataset with a 100
lenging for several reasons. Fully-supervised end-to-end training word vocabulary. In this work we train on real speech, including
requires collecting a large set of input/output speech pairs. Such spontaneous telephone conversations, at a much larger scale.
data are more difficult to collect compared to parallel text pairs
for MT, or speech-text pairs for ASR or TTS. Decomposing 2. Speech-to-speech translation model
into smaller tasks can take advantage of the lower training data
requirements compared to a monolithic speech-to-speech model, An overview of the proposed Translatotron model architecture is
and can result in a more robust system for a given training budget. shown in Figure 1. Following [15, 26], it is composed of several
Uncertain alignment between two spectrograms whose underly- separately trained components: 1) an attention-based sequence-
ing spoken content differs also poses a major training challenge.
1 Audio samples are available at https://google-research.
* Equal contribution. github.io/lingvo-lab/translatotron .
speaker Speaker Vocoder
waveform Table 1: Dataset-specific model hyperparameters.
reference (English)
utterance Encoder
Multihead Spectrogram linear freq Conversational Fisher
concat spectrogram
Attention Decoder (English) Num train examples 979k 120k
8-layer phonemes
Input / output sample rate (Hz) 16k / 24k 8k / 24k
Stacked Attention 2× LSTM Decoder (English) Learning rate 0.002 0.006
BLSTM Encoder BLSTM 8×1024 8×256
phonemes
Encoder Attention 2× LSTM Decoder (Spanish) Decoder LSTM 6×1024 4×1024
Auxiliary recognition tasks
Auxiliary decoder LSTM 2×256 2×256
source / target input layer 8/8 4/6
log-mel spectrogram
(Spanish)
dropout prob 0.2 0.3
loss decay constant 1.0 0.3 → 0.001
Figure 1: Proposed model architecture, which generates English at 160k steps
speech (top right) from Spanish speech (bottom left), and an Gaussian weight noise stddev none 0.05
optional speaker reference utterance (top left) which is only used
for voice transfer experiments in Section 3.4. The model is multi-
task trained to predict source and target phoneme transcripts as uses the Adafactor optimizer [34] with a batch size of 1024.
well, however these auxiliary tasks are not used during inference. Since we are only demonstrating a proof of concept, we
Optional components are drawn in light colors. primarily rely on the low-complexity Griffin-Lim [35] vocoder
in our experiments. However, we use a WaveRNN [36] neural
vocoder when evaluating speech naturalness in listening tests.
to-sequence network (blue) which generates target spectrograms, Finally, in order to control the output speaker identity we
2) a vocoder (red) which converts target spectrograms to time- incorporate an optional speaker encoder network as in [15]. This
domain waveforms, and, 3) optionally, a pretrained speaker network is discriminatively pretrained on a speaker verification
encoder (green) which can be used to condition the decoder on task and is not updated during the training of Translatotron. We
the identity of the source speaker, enabling cross-language voice use the dvector V3 model from [37], trained on a larger set of
conversion [27] simultaneously with translation. 851K speakers across 8 languages including English and Span-
The sequence-to-sequence encoder stack maps 80-channel ish. The model computes a 256-dim speaker embedding from
log-mel spectrogram input features into hidden states which the speaker reference utterance, which is passed into a linear
are passed through an attention-based alignment mechanism to projection layer (trained with the sequence-to-sequence model)
condition an autoregressive decoder, which predicts 1025-dim to reduce the dimensionality to 16. This is critical to generalizing
log spectrogram frames corresponding to the translated speech. to source language speakers which are unseen during training.
Two optional auxiliary decoders, each with their own attention
components, predict source and target phoneme sequences. 3. Experiments
Following recent speech translation [21] and recognition
[28] models, the encoder is composed of a stack of 8 bidirectional We study two Spanish-to-English translation datasets: the large
LSTM layers. As shown in Fig. 1, the final layer output is passed scale “conversational” corpus of parallel text and read speech
to the primary decoder, whereas intermediate activations are pairs from [21], and the Spanish Fisher corpus of telephone con-
passed to auxiliary decoders predicting phoneme sequences.We versations and corresponding English translations [38], which is
hypothesize that early layers of the encoder are more likely to smaller and more challenging due to the spontaneous and infor-
represent the source content well, while deeper layers might mal speaking style. In Sections 3.1 and 3.2, we synthesize target
learn to encode more information about the target content. speech from the target transcript using a single (female) speaker
English TTS system; In Section 3.4, we use real human tar-
The spectrogram decoder uses an architecture similar to
get speech for voice transfer experiments on the conversational
Tacotron 2 TTS model [26], including pre-net, autoregressive
dataset. Models were implemented using the Lingvo frame-
LSTM stack, and post-net components. We make several
work [39]. See Table 1 for dataset-specific hyperparameters.
changes to it in order to adapt to the more challenging S2ST task.
To evaluate speech-to-speech translation performance we
We use multi-head additive attention [29] with 4 heads instead of
compute BLEU scores [40] as an objective measure of speech
location-sensitive attention, which shows better performance in
intelligibility and translation quality, by using a pretrained ASR
our experiments. We also use a significantly narrower 32 dimen-
system to recognize the generated speech, and comparing the
sional pre-net bottleneck compared to 256-dim in [26], which we
resulting transcripts to ground truth reference translations. Due
find to be critical in picking up attention during training. We also
to potential recognition errors (see Figure 2), this can be thought
use reduction factor [30] of 2, i.e. predicting two spectrogram
of as a lower bound on the underlying translation quality. We
frames for each decoding step. Finally, consistent with results
use the 16k Word-Piece attention-based ASR model from [41]
on translation tasks [19, 31], we find that using a deeper decoder
trained on the 960 hour LibriSpeech corpus [42], which obtained
containing 4 or 6 LSTM layers leads to good performance.
word error rates of 4.7% and 13.4% on the test-clean and test-
We find that multitask training is critical in solving the task, other sets, respectively. In addition, we conduct listening tests
which we accomplish by integrating auxiliary decoder networks to measure subjective speech naturalness mean opinion score
to predict phoneme sequences corresponding to the source and/or (MOS), as well as speaker similarity MOS for voice transfer.
target speech. Losses computed using these auxiliary recogni-
tion networks are used during training, which help the primary
3.1. Conversational Spanish-to-English
spectrogram decoder to learn attention. They are not used dur-
ing inference. In contrast to the primary decoder, the auxiliary This proprietary dataset described in [21] was obtained by crowd-
decoders use 2-layer LSTMs with single-head additive atten- sourcing humans to read the both sides of a conversational
tion [32]. All three decoders use attention dropout and LSTM Spanish-English MT dataset. In this section, instead of using the
zoneout regularization [33], all with probability 0.1. Training human target speech, we use a TTS model to synthesize target
Table 2: Conversational test set performance. Single reference Table 3: Performance on the Fisher Spanish-English task in
BLEU and Phoneme Error Rate (PER) of aux decoder outputs. terms of 4-reference BLEU score on 3 eval sets.

Auxiliary loss BLEU Source PER Target PER Auxiliary loss dev1 dev2 test
None 0.4 - - None 0.4 0.6 0.6
Source 42.2 5.0 - Source 7.4 8.0 7.2
Target 42.6 - 20.9 Target 20.2 21.4 20.8
Source + Target 42.7 5.1 20.8 Source + Target 24.8 26.5 25.6
ST [21] → TTS cascade 48.7 - - Source + Target (1-head attention) 23.0 24.2 23.4
Ground truth 74.7 - - Source + Target (encoder pre-training) 30.1 31.5 31.1
ST [19] → TTS cascade 39.4 41.2 41.4
speech in a single female English speaker’s voice in order to Ground truth 82.8 83.8 85.3
simplify the learning objective. We use an English Tacotron 2
TTS model [26] but use a Griffin-Lim vocoder for expediency. Table 4: Naturalness MOS with 95% confidence intervals. The
In addition, we augment the input source speech by adding back- ground truth for both datasets are synthetic English speech.
ground noise and reverberation in the same manner as [21].
The resulting dataset contains 979k parallel utterance pairs, Model Vocoder Conversational Fisher-test
containing 1.4k hours of source speech and 619 hours of syn- Translatotron WaveRNN 4.08 ± 0.06 3.69 ± 0.07
thesized target speech. The total target speech duration is much Griffin-Lim 3.20 ± 0.06 3.05 ± 0.08
smaller because the TTS output is better endpointed, and con-
tains fewer pauses. 9.6k pairs are held out for testing. ST→TTS WaveRNN 4.32 ± 0.05 4.09 ± 0.06
Input feature frames are created by stacking 3 adjacent Griffin-Lim 3.46 ± 0.07 3.24 ± 0.07
frames of an 80-channel log-mel spectrogram as in [21]. The Ground truth Griffin-Lim 3.71 ± 0.06 -
speaker encoder was not used in these experiments since the Parallel WaveNet - 3.96 ± 0.06
target speech always came from the same speaker.
Table 2 shows performance of the model trained using dif-
ferent combinations of auxiliary losses, compared to a base- more careful regularization and tuning. As shown in Table 1,
line ST → TTS cascade model using a speech-to-text translation we used narrower encoder dimension of 256, a shallower 4-
model [21] trained on the same data, and the same Tacotron 2 layer decoder, and added Gaussian weight noise to all LSTM
TTS model used to synthesize training targets. Note that the weights as regularization, as in [19]. The model was especially
ground truth BLEU score is below 100 due to ASR errors during sensitive to the auxiliary decoder hyperparameters, with the best
evaluation, or TTS failure when synthesizing the ground truth. performance coming when passing activations from intermediate
Training without auxiliary losses leads to extremely poor layers of the encoder stack as inputs to the auxiliary decoders,
performance. The model correctly synthesizes common words using slightly more aggressive dropout of 0.3, and decaying
and simple phrases, e.g. translating “hola” to “hello”. However, the auxiliary loss weight over the course of training in order to
it does not consistently translate full utterances. While it always encourage the model training to fit the primary S2ST task.
generates plausible speech sounds in the target voice, the output Experiment results are shown in Table 3. Once again using
can be independent of the input, composed of a string of non- two auxiliary losses works best, but in contrast to Section 3.1,
sense syllables. This is consistent with failure to learn to attend there is a large performance boost relative to using either one
to the input, and reflects the difficulty of the direct S2ST task. alone. Performance using only the source recognition loss is very
Integrating auxiliary phoneme recognition tasks helped reg- poor, indicating that learning alignment on this task is especially
ularize the encoder and enabled the model to learn attention, difficult without strong supervision on the translation task.
dramatically improving performance. The target phoneme PER We found that 4-head attention works better than one head,
is much higher than on source phonemes, reflecting the diffi- unlike the conversational task, where both attention mechanisms
culty of the corresponding translation task. Training using both had similar performance. Finally, as in [21], we find that pre-
auxiliary tasks achieved the best quality, but the performance training the bottom 6 encoder layers on an ST task improves
difference between different combinations is small. Overall, BLEU scores by over 5 points. This is the best performing direct
there remains a gap of 6 BLEU points to the baseline, indicating S2ST model, obtaining 76% of the baseline performance.
room for improvement. Nevertheless, the relatively narrow gap
demonstrates the potential of the end-to-end approach. 3.3. Subjective evaluation of speech naturalness
To evaluate synthesis quality of the best performing models from
3.2. Fisher Spanish-to-English Tables 2 and 3 we use the framework from [15] to crowdsource
This dataset contains about 120k parallel utterance pairs2 , span- 5-point MOS evaluations based on subjective listening tests. 1k
ning 127 hours of source speech. Target speech is synthesized examples were rated for each dataset, each one by a single rater.
using Parallel WaveNet [43] using the same voice as the previous Although this evaluation is expected to be independent of the
section. The result contains 96 hours of synthetic target speech. correctness of the translation, translation errors can result in low
Following [19], input features were constructed by stacking scores for examples raters describe as “not understandable”.
80-channel log-mel spectrograms, with deltas and accelerations. Results are shown in Table 4, comparing different vocoders
Given the small size of the dataset compared to that in Sec. 3.1, where results with Griffin-Lim correspond to identical model
we found that obtaining good performance required significantly configurations as Sections 3.1 and 3.2. As expected, using Wav-
eRNN vocoders dramatically improves ratings over Griffin-Lim
2 This is a subset of the Fisher data due to TTS errors on target text. into the “Very Good” range (above 4.0). Note that it is most fair
Source: "Qué tal, eh, yo soy Guillermo, ¿Cómo estás?" 5 Table 5: Voice transfer performance when conditioned on source,
70 4
Mel channel

60 3 ground truth target, or a random utterance in the target language.


50 2
40 1 References for MOS-similarity match the conditioning speaker.
30 0
20 1
10 2
0 3
Target: "How's it going, hey, this is Guillermo, How are you?" 5 Speaker Emb BLEU MOS-naturalness MOS-similarity
70 4
Mel channel

60 3 Source 33.6 3.07 ± 0.08 1.85 ± 0.06


50 2
40 1
30 0 Target 36.2 3.15 ± 0.08 3.30 ± 0.09
20 1
10 2 Random target 35.4 3.08 ± 0.08 3.24 ± 0.08
0 3
ST TTS: "hey i'm william how are you" 5
70 4 Ground truth 59.9 4.10 ± 0.06 -
Mel channel

60 3
50 2
40 1
30 0
20 1
10 2 strategies. The top row transfers the source speaker’s voice to the
0 3
Translatotron: "hi a i'm of the ermo how are you" 5 translated speech, while row two is a “cheating” configuration
70 4
Mel channel

60 3 since the speaker embedding can potentially leak information


50 2
40 1 about the target content to the decoder. To verify that this does
30 0
20 1 not negatively impact performance we also condition on random
10 2
0 3 target utterances in row three. In all cases performance is worse
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Time (sec) than models trained on synthetic targets in Tables 2 and 4. This is
Figure 2: Mel spectrograms of input (top, upsampled to 24 kHz) because the task of synthesizing arbitrary speakers is more diffi-
and WaveRNN vocoder output (bottom) waveforms from a Fisher cult; the training targets are much noisier and training set is much
corpus example, along with ASR transcripts. Note that the spec- smaller; and the ASR model used for evaluation makes more er-
trogram scales are different to the model inputs and outputs. rors on the noisy, multispeaker targets. In terms of BLEU score,
Corresponding audio is on the companion website. the difference between conditioning on ground truth and random
targets is very small, verifying that content leak is not a concern
to compare the Griffin-Lim results to the ground truth training (in part due to the low speaker embedding dimension). However
targets since they were generated using corresponding lower conditioning on the source trails by 1.8 BLEU points, reflecting
quality vocoders. In such a comparison it is clear that the S2ST the mismatch in conditioning language between the training and
models do not score as highly as the ground truth. inference configurations. Naturalness MOS scores are close in
Finally, we note the similar performance gap between Trans- all cases. However, conditioning on the source speaker signifi-
latotron and the baseline under this evaluation. In part, this is cantly reduces similarity MOS by 1.4 points. Again this suggests
a consequence of the different types of errors made by the two that using English speaker embeddings during training does not
models. For example, Translatotron sometimes mispronounces generalize well to Spanish speakers.
words, especially proper nouns, using pronunciations from the
source language, e.g. mispronouncing the /ae/ vowel in “Dan” 4. Conclusions
as /ah/, consistent with Spanish but sounding less natural to
We present a direct speech-to-speech translation model, trained
English listeners, whereas by construction, the baseline consis-
end-to-end. We find that it is important to use speech tran-
tently projects results to English. Figure 2 demonstrates other
scripts during training, but no intermediate speech transcription
differences in behavior, where Translatotron reproduces the in-
is necessary for inference. Exploring alternate training strategies
put “eh” disfluency (transcribed as “a”, between 0.4 − 0.8 sec
which alleviate this requirement is an interesting direction for
in the bottom row of the figure), but the cascade does not. It is
future work. The model achieves high translation quality on
also interesting to note that the cascade translates “Guillermo”
two Spanish-to-English datasets, although performance is not as
to its English form “William”, whereas Translatotron speaks
good as a baseline cascade of ST and TTS models.
the Spanish name (although the ASR model mistranscribes it
as “of the ermo”), suggesting that the direct model might have In addition, we demonstrate a variant which simultaneously
a bias toward more directly reconstructing the input. Similarly, transfers the source speaker’s voice to the translated speech.
in example 7 on the companion page Translatotron reconstructs The voice transfer does not work as well as in a similar TTS
“pasejo” as “passages” instead of “tickets”, potentially reflecting context [15], reflecting the difficulty of the cross-language voice
a bias for cognates. We leave detailed analysis to future work. transfer task, as well as evaluation [44]. Potential strategies
to improve voice transfer performance include improving the
3.4. Cross language voice transfer speaker encoder by adding a language adversarial loss, or by
incorporating a cycle-consistency term [13] into the S2ST loss.
In our final experiment, we synthesize translated speech using the Other future work includes utilizing weakly supervision to
voice of the source speaker by training the full model depicted in scale up training with synthetic data [21] or multitask learning
Figure 1. The speaker encoder is conditioned on the ground truth [19,20], and transferring prosody and other acoustic factors from
target speaker during training. We use a subset of the data from the source speech to the translated speech following [45–47].
Sec. 3.1 for which we have paired source and target recordings.
Note that the source and target speakers for each pair are always
different – the data was not collected from bilingual speakers.
5. Acknowledgements
This dataset contains 606k utterance pairs, resampled to 16 kHz, The authors thank Quan Wang, Jason Pelecanos and the Google
with 863 and 493 hours of source and target speech, respectively; Speech team for providing the multilingual speaker encoder, Tom
6.3k pairs, a subset of that from Sec. 3.1, are held out for testing. Walters and the Deepmind team for help with WaveNet TTS,
Since target recordings contained noise, we apply the denoising Quan Wang, Heiga Zen, Patrick Nguyen, Yu Zhang, Jonathan
and volume normalization from [15] to improve output quality. Shen, Orhan Firat, and the Google Brain team for helpful discus-
Table 5 compares performance using different conditioning sions, and Mengmeng Niu for data collection support.
6. References [24] F. Biadsy, R. J. Weiss, P. J. Moreno, D. Kanevsky, and Y. Jia,
“Parrotron: An end-to-end speech-to-speech conversion model and
[1] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, its applications to hearing-impaired speech and speech separation,”
T. Zeppenfeld, and P. Zhan, “JANUS-III: Speech-to-speech trans- in Proc. Interspeech, 2019.
lation in multiple languages,” in Proc. ICASSP, 1997.
[25] M. Guo, A. Haque, and P. Verma, “End-to-end spoken language
[2] W. Wahlster, Verbmobil: Foundations of speech-to-speech transla- translation,” arXiv preprint arXiv:1904.10760, 2019.
tion. Springer, 2000.
[26] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,
[3] S. Nakamura, K. Markov, H. Nakaiwa, G.-i. Kikui, H. Kawai, T. Jit- Z. Chen, Y. Zhang et al., “Natural TTS synthesis by conditioning
suhiro, J.-S. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto, WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2017.
“The ATR multilingual speech-to-speech translation system,” IEEE
[27] A. F. Machado and M. Queiroz, “Voice conversion: A critical
Transactions on Audio, Speech, and Language Processing, 2006.
survey,” in Proc. Sound and Music Computing, 2010, pp. 1–8.
[4] International Telecommunication Union, “ITU-T F.745: Func- [28] C.-C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen,
tional requirements for network-based speech-to-speech translation A. Kannan, R. Weiss, K. Rao et al., “State-of-the-art speech recog-
services,” 2016. nition with sequence-to-sequence models,” in Proc. ICASSP, 2018.
[5] H. Ney, “Speech translation: Coupling of recognition and transla- [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
tion,” in Proc. ICASSP, 1999. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
[6] E. Matusov, S. Kanthak, and H. Ney, “On the integration of speech in Proc. NeurIPS, 2017.
recognition and statistical machine translation,” in European Con- [30] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
ference on Speech Communication and Technology, 2005. Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le et al., “Tacotron:
[7] E. Vidal, “Finite-state speech-to-speech translation,” in Proc. Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017.
ICASSP, 1997. [31] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
[8] F. Casacuberta, H. Ney, F. J. Och, E. Vidal, J. M. Vilar et al., M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
“Some approaches to statistical and finite-state speech-to-speech machine translation system: Bridging the gap between human and
translation,” Computer Speech and Language, vol. 18, no. 1, 2004. machine translation,” arXiv:1609.08144, 2016.
[9] P. Aguero, J. Adell, and A. Bonafonte, “Prosody generation for [32] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
speech-to-speech translation,” in Proc. ICASSP, 2006. by jointly learning to align and translate,” in Proc. ICLR, 2015.
[33] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.
[10] Q. T. Do, S. Sakti, and S. Nakamura, “Toward expressive speech
Ke, A. Goyal, Y. Bengio et al., “Zoneout: Regularizing RNNs by
translation: a unified sequence-to-sequence LSTMs approach for
randomly preserving hidden activations,” in Proc. ICLR, 2017.
translating words and emphasis,” in Proc. Interspeech, 2017.
[34] N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with
[11] T. Kano, S. Takamichi, S. Sakti, G. Neubig, T. Toda, and S. Naka- sublinear memory cost,” in Proc. ICML, 2018, pp. 4603–4611.
mura, “An end-to-end model for cross-lingual transformation of
paralinguistic information,” Machine Translation, pp. 1–16, 2018. [35] D. Griffin and J. Lim, “Signal estimation from modified short-time
Fourier transform,” IEEE Transactions on Acoustics, Speech, and
[12] M. Kurimo, W. Byrne, J. Dines, P. N. Garner, M. Gibson, Y. Guan, Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
T. Hirsimäki, R. Karhila, S. King, H. Liang et al., “Personalising
speech-to-speech translation in the EMIME project,” in Proc. ACL [36] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,
2010 System Demonstrations, 2010. E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman et al., “Effi-
cient neural audio synthesis,” in Proc. ICML, 2018.
[13] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new
[37] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully
speakers based on a short untranscribed sample,” in ICML, 2018.
supervised speaker diarization,” in Proc. ICASSP, 2019.
[14] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice [38] M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch et al.,
cloning with a few samples,” in Proc. NeurIPS, 2018. “Improved speech-to-text translation with the Fisher and Callhome
[15] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen Spanish–English speech translation corpus,” in Proc. IWSLT, 2013.
et al., “Transfer learning from speaker verification to multispeaker [39] J. Shen, P. Nguyen, Y. Wu, Z. Chen et al., “Lingvo: a modular and
text-to-speech synthesis,” in Proc. NeurIPS, 2018. scalable framework for sequence-to-sequence modeling,” 2019.
[16] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, [40] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method
Q. Wang, L. C. Cobo, A. Trask, B. Laurie et al., “Sample efficient for automatic evaluation of machine translation,” in ACL, 2002.
adaptive text-to-speech,” in Proc. ICLR, 2019. [41] K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach,
[17] A. Bérard, O. Pietquin, C. Servan, and L. Besacier, “Listen and and P. Nguyen, “Model unit exploration for sequence-to-sequence
translate: A proof of concept for end-to-end speech-to-text transla- speech recognition,” arXiv:1902.01955, 2019.
tion,” in NeurIPS Workshop on End-to-end Learning for Speech [42] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech:
and Audio Processing, 2016. an ASR corpus based on public domain audio books,” in Proc.
[18] A. Bérard, L. Besacier, A. C. Kocabiyikoglu, and O. Pietquin, ICASSP, 2015.
“End-to-end automatic speech translation of audiobooks,” in Proc. [43] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
ICASSP, 2018. K. Kavukcuoglu, G. van den Driessche et al., “Parallel WaveNet:
[19] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, Fast high-fidelity speech synthesis,” in Proc. ICML, 2018.
“Sequence-to-sequence models can directly translate foreign [44] M. Wester, J. Dines, M. Gibson, H. Liang et al., “Speaker adap-
speech,” in Proc. Interspeech, 2017. tation and the evaluation of speaker similarity in the EMIME
[20] A. Anastasopoulos and D. Chiang, “Tied multitask learning for speech-to-speech translation project,” in ISCA Tutorial and Re-
neural speech translation,” in Proc. NAACL-HLT, 2018. search Workshop on Speech Synthesis, 2010.
[21] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, [45] Y. Lee and T. Kim, “Robust and fine-grained prosody control of
N. Ari et al., “Leveraging weakly supervised data to improve end- end-to-end speech synthesis,” in Proc. ICASSP, 2019.
to-end speech-to-text translation,” in Proc. ICASSP, 2019. [46] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg,
J. Shor et al., “Style tokens: Unsupervised style modeling, control
[22] A. Haque, M. Guo, and P. Verma, “Conditional end-to-end audio
and transfer in end-to-end speech synthesis,” in Proc. ICML, 2018.
transforms,” in Proc. Interspeech, 2018.
[47] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao,
[23] J. Zhang, Z. Ling, L.-J. Liu, Y. Jiang, and L.-R. Dai, “Sequence- Y. Jia, Z. Chen, J. Shen et al., “Hierarchical generative modeling
to-sequence acoustic modeling for voice conversion,” IEEE/ACM for controllable speech synthesis,” in Proc. ICLR, 2019.
Transactions on Audio, Speech, and Language Processing, 2019.

You might also like