Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
Ye Jia* , Ron J. Weiss* , Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, Yonghui Wu
Google
{jiaye,ronw}@google.com
Auxiliary loss BLEU Source PER Target PER Auxiliary loss dev1 dev2 test
None 0.4 - - None 0.4 0.6 0.6
Source 42.2 5.0 - Source 7.4 8.0 7.2
Target 42.6 - 20.9 Target 20.2 21.4 20.8
Source + Target 42.7 5.1 20.8 Source + Target 24.8 26.5 25.6
ST [21] → TTS cascade 48.7 - - Source + Target (1-head attention) 23.0 24.2 23.4
Ground truth 74.7 - - Source + Target (encoder pre-training) 30.1 31.5 31.1
ST [19] → TTS cascade 39.4 41.2 41.4
speech in a single female English speaker’s voice in order to Ground truth 82.8 83.8 85.3
simplify the learning objective. We use an English Tacotron 2
TTS model [26] but use a Griffin-Lim vocoder for expediency. Table 4: Naturalness MOS with 95% confidence intervals. The
In addition, we augment the input source speech by adding back- ground truth for both datasets are synthetic English speech.
ground noise and reverberation in the same manner as [21].
The resulting dataset contains 979k parallel utterance pairs, Model Vocoder Conversational Fisher-test
containing 1.4k hours of source speech and 619 hours of syn- Translatotron WaveRNN 4.08 ± 0.06 3.69 ± 0.07
thesized target speech. The total target speech duration is much Griffin-Lim 3.20 ± 0.06 3.05 ± 0.08
smaller because the TTS output is better endpointed, and con-
tains fewer pauses. 9.6k pairs are held out for testing. ST→TTS WaveRNN 4.32 ± 0.05 4.09 ± 0.06
Input feature frames are created by stacking 3 adjacent Griffin-Lim 3.46 ± 0.07 3.24 ± 0.07
frames of an 80-channel log-mel spectrogram as in [21]. The Ground truth Griffin-Lim 3.71 ± 0.06 -
speaker encoder was not used in these experiments since the Parallel WaveNet - 3.96 ± 0.06
target speech always came from the same speaker.
Table 2 shows performance of the model trained using dif-
ferent combinations of auxiliary losses, compared to a base- more careful regularization and tuning. As shown in Table 1,
line ST → TTS cascade model using a speech-to-text translation we used narrower encoder dimension of 256, a shallower 4-
model [21] trained on the same data, and the same Tacotron 2 layer decoder, and added Gaussian weight noise to all LSTM
TTS model used to synthesize training targets. Note that the weights as regularization, as in [19]. The model was especially
ground truth BLEU score is below 100 due to ASR errors during sensitive to the auxiliary decoder hyperparameters, with the best
evaluation, or TTS failure when synthesizing the ground truth. performance coming when passing activations from intermediate
Training without auxiliary losses leads to extremely poor layers of the encoder stack as inputs to the auxiliary decoders,
performance. The model correctly synthesizes common words using slightly more aggressive dropout of 0.3, and decaying
and simple phrases, e.g. translating “hola” to “hello”. However, the auxiliary loss weight over the course of training in order to
it does not consistently translate full utterances. While it always encourage the model training to fit the primary S2ST task.
generates plausible speech sounds in the target voice, the output Experiment results are shown in Table 3. Once again using
can be independent of the input, composed of a string of non- two auxiliary losses works best, but in contrast to Section 3.1,
sense syllables. This is consistent with failure to learn to attend there is a large performance boost relative to using either one
to the input, and reflects the difficulty of the direct S2ST task. alone. Performance using only the source recognition loss is very
Integrating auxiliary phoneme recognition tasks helped reg- poor, indicating that learning alignment on this task is especially
ularize the encoder and enabled the model to learn attention, difficult without strong supervision on the translation task.
dramatically improving performance. The target phoneme PER We found that 4-head attention works better than one head,
is much higher than on source phonemes, reflecting the diffi- unlike the conversational task, where both attention mechanisms
culty of the corresponding translation task. Training using both had similar performance. Finally, as in [21], we find that pre-
auxiliary tasks achieved the best quality, but the performance training the bottom 6 encoder layers on an ST task improves
difference between different combinations is small. Overall, BLEU scores by over 5 points. This is the best performing direct
there remains a gap of 6 BLEU points to the baseline, indicating S2ST model, obtaining 76% of the baseline performance.
room for improvement. Nevertheless, the relatively narrow gap
demonstrates the potential of the end-to-end approach. 3.3. Subjective evaluation of speech naturalness
To evaluate synthesis quality of the best performing models from
3.2. Fisher Spanish-to-English Tables 2 and 3 we use the framework from [15] to crowdsource
This dataset contains about 120k parallel utterance pairs2 , span- 5-point MOS evaluations based on subjective listening tests. 1k
ning 127 hours of source speech. Target speech is synthesized examples were rated for each dataset, each one by a single rater.
using Parallel WaveNet [43] using the same voice as the previous Although this evaluation is expected to be independent of the
section. The result contains 96 hours of synthetic target speech. correctness of the translation, translation errors can result in low
Following [19], input features were constructed by stacking scores for examples raters describe as “not understandable”.
80-channel log-mel spectrograms, with deltas and accelerations. Results are shown in Table 4, comparing different vocoders
Given the small size of the dataset compared to that in Sec. 3.1, where results with Griffin-Lim correspond to identical model
we found that obtaining good performance required significantly configurations as Sections 3.1 and 3.2. As expected, using Wav-
eRNN vocoders dramatically improves ratings over Griffin-Lim
2 This is a subset of the Fisher data due to TTS errors on target text. into the “Very Good” range (above 4.0). Note that it is most fair
Source: "Qué tal, eh, yo soy Guillermo, ¿Cómo estás?" 5 Table 5: Voice transfer performance when conditioned on source,
70 4
Mel channel
60 3
50 2
40 1
30 0
20 1
10 2 strategies. The top row transfers the source speaker’s voice to the
0 3
Translatotron: "hi a i'm of the ermo how are you" 5 translated speech, while row two is a “cheating” configuration
70 4
Mel channel