Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
221
click; 4) we exclude all cases when the standard version pro-
duces higher energy sound so we consider only ratios exceed-
ing one; if no term satisfies this condition, the maximum is set
to zero.
The resulting loss function penalizes frames splitting at
which results in clicks and, on the other hand, encourages split-
ting at frames when no click is observed:
Loss(x) = pθ (x) · µ(x) + (1 − pθ (x)) · C (3) Figure 3: Unmodified waves (left) and the same waves but with
shift applied to one of them (right). Linear cross-fading works
where x denotes the analyzed frame, pθ (x) is the probability
better for the two waves on the right.
(given by the neural network with parameters θ) that this frame
can be used to divide spectrogram into parts synthesized inde-
pendently without loss in quality, µ(x) is the measure of a click Table 1: A/B testing results.
(calculated by Equation 2) that appears if this frame is consid-
ered as splitting one and C is some positive borderline value: if Which is better? Non-parallel Parallel Identical
the measure of a click µ(x) is less than C, we regard it as “no EB-splitting 23.9% 21.5% 54.6%
click”. NN-splitting 21.1% 19.4% 59.5%
To train such a network we create a dataset consisting of XF with shift 14.4% 17.5% 68.1%
pairs of audio records synthesized with and without paralleliza- Which is better? W/o shift With shift Identical
tion. To generate audio with parallelization for this dataset we Cross-fading 7.3% 43.3% 49.4%
allocate splitting frames at random. After the network is trained,
it is also possible to create a new dataset where splitting frames α α
are allocated in accordance with pθ (x) rather than randomly i (1) i (2)
si = 1− si + si+m , i = 0, .., N (6)
and to continue the training on the new dataset. However, we N N
found that such a training procedure that resembles imitation where M is the maximum possible shift value and W is size
learning [19] does not improve quality. of the window used for calculation L1 distance between signals
At test time, frames with pθ (x) > 0.95 were considered for (we found that minimizing L1 distance leads to the same quality
splitting. Neural models found about 10% more splitting frames as maximizing correlation). We put M = W = N/2 in our
than the energy-based algorithm for all languages we tested. experiments. We also introduce additional parameter α: the
larger α is, the longer the first wave s(1) does not fade out (high
3.2. Cross-fading cross-fading quality was observed for α ∈ [1; 3]).
Instead of detecting frames that split spectrogram into indepen- The code for splitting frames detection (both energy-
dent parts so that the corresponding synthesized waveforms can based and network-based methods) and synthesis with
simply be concatenated without any post-processing, we can fo- linear cross-fading (both with and without shift) is avail-
cus on improving the post-processing techniques that work well able at https://github.com/li1jkdaw/LPCNet_
for any choice of a splitting frame. This can be done with decent parallel/tree/code.
quality if the segments synthesized in parallel are overlapped.
As a baseline we take linear cross-fading technique [11]. 4. Performance evaluation
Assume that the vocoder is processing two segments in parallel We performed subjective human evaluation tests on Amazon
so that the first segment contains frames 1, .., k while the second Mechanical Turk. Four single-speaker datasets were used: we
one contains frames k, k + 1, .. That is, two waves overlap in k- trained models on male Italian and female English, French
th frame. Concatenating the first wave s(1) and the second wave and Spanish speakers. All the datasets are internal except the
s(2) with linear cross-fading means that in k-th frame samples English one which is LJSpeech dataset [20]. A small por-
of the resulting wave s are given by: tion of audio records that were used in our experiments is
(1) (2) available at https://li1jkdaw.github.io/LPCNet_
si = (1 − ai )si + ai si , i = 0, .., N (4) parallel.
where N is the number of samples in a frame. In our experi-
ments we took ai = i/N so that s(1) fades out uniformly. 4.1. A/B testing
An ideal choice of coefficients in the linear combination (4) In order to check that the methods of vocoder parallelization
is known to depend on correlation between s(1) and s(2) (see described in Section 3 do perform well we conducted a series
[11]). Intuitively, it is clear that for highly correlated signals of A/B tests. In each of these tests the participants were pre-
the quality of concatenation with cross-fading is less dependent sented with 25 pairs of recordings of the same sentence and
on the values of coefficients in Equation 4. In the limiting case asked to choose which one they preferred in case they heard
when s(1) is equal to s(2) the values of these coefficients do not any difference. In each pair the records were synthesized con-
even matter once they sum to one. It suggests that cross-fading ditioned on the same spectrogram since we did not want a small
quality can improve after applying a left shift to the second sig- variation in the Tacotron2 output to affect the result. The main
nal s(2) such that its correlation with s(1) increases. This idea purpose of the tests was to ensure that the parallelization did
is illustrated in Figure 3. not lead to degradation of the sound quality. As we had several
Thus, we consider linear cross-fading with shift: approaches to vocoder parallelization we carried out separate
W
A/B tests for different system design choices. For the strategy
involving synthesis of non-overlapping segments we tested two
X (2) (1)
m = arg min |si+j − si | (5)
j=0,..,M
i=0
splitting frame detection methods i.e. energy-based and neural
222
Table 2: Mean Opinion Scores for ground truth records and speech synthesized with different methods.
Dataset Duration Vocoder Ground Truth Non-parallel EB-splitting NN-splitting XF with shift
English 24 hours WaveRNN 4.46 ± 0.17 4.08 ± 0.20 4.15 ± 0.22 — 4.22 ± 0.20
English 24 hours LPCNet 3.98 ± 0.12 3.74 ± 0.10 3.78 ± 0.10 3.71 ± 0.11 3.75 ± 0.11
Italian 23 hours LPCNet 4.15 ± 0.14 3.45 ± 0.16 3.60 ± 0.16 3.42 ± 0.26 3.56 ± 0.19
French 8 hours LPCNet 4.46 ± 0.10 3.83 ± 0.17 3.86 ± 0.16 3.88 ± 0.17 3.84 ± 0.19
Spanish 17 hours LPCNet 4.43 ± 0.12 3.54 ± 0.11 3.52 ± 0.11 3.50 ± 0.12 3.50 ± 0.18
network-based criteria (EB-splitting and NN-splitting respec- Table 3: Vocoder parallelization efficiency.
tively). For the alternative strategy involving synthesis of over-
lapping segments we tested the cross-fading with shift as the
post-processing technique (XF with shift). In the latter case, we 1 thread 2 threads 3 threads
allocated 2 splitting frames per second. Each pair in all tests FFD RTF FFD RTF FFD RTF
was evaluated by at least 20 listeners. The tests were performed MT6762 323 1.64 345 1.07 352 0.89
on English data only. Kirin950 202 1.18 213 0.75 243 0.67
The results of these tests are presented in Table 1. In more Kirin710 170 1.09 184 0.69 191 0.56
than half of cases people noticed no difference between parallel
and non-parallel versions. In the remaining cases the difference
between the presented methods was small. To obtain statisti-
cally significant results, we applied sign test [21] and concluded the difference between ground truth MOS in English tests with
that we can’t reject (at 95% confidence level) the hypothesis that LPCNet and WaveRNN.
speech synthesized with the analyzed methods of parallelization
has the same quality as the one synthesized in a normal way.
4.3. Efficiency evaluation
We also performed another A/B test to show that the lin-
ear cross-fading with shift leads to better sound quality than the To test overall performance we implemented Tacotron2 and
same method without shift (XF w/o shift). Sign test applied to LPCNet (with splitting frames detection by energy-based
the results of Table 1 allows us to reject the hypothesis that the criterion) on several mobile devices with 8 core ARM
cross-fading without shift works at least as well as the one with processors: Mediatek MT6762 (4x2.0GHz Cortex-A53 +
shift. When a splitting frame corresponds to a vowel, linear 4x1.5GHz Cortex-A53), Kirin950 (4x2.3GHz Cortex-A72 +
cross-fading without shift leads to audible artifacts. In contrast 4x1.8GHz Cortex-A53) and Kirin710 (4x2.2GHz Cortex-A73
to the previous A/B test with overlapping segments, for this one + 4x1.7GHz Cortex-A53).
we allocated 10 splitting frames per second instead of 2 to un- The whole TTS application requires only 12.5Mb of stor-
derline the mentioned problem and check if adding shift can fix age (11.4Mb for Tacotron2 and 1.1Mb for LPCNet). All
it. weights are stored as 8-bit numbers. As for the speed, we re-
port average values of Real Time Factor (RTF) and First Frame
4.2. MOS evaluation Delay (FFD, see Section 2) in Table 3. RTF is defined as the
To evaluate the overall quality and naturalness of the speech time it takes to synthesize some piece of an audio divided by its
synthesized with our TTS system, we launched Mean Opinion duration. FFD is measured in milliseconds.
Score (MOS) evaluation for four languages. Additionally, we Table 3 shows that using 3 threads for parallel vocoder gives
trained Tacotron2 + WaveRNN system on English dataset sam- almost 2x speedup resulting in faster than real-time synthesis.
pled at 22kHz to show that the vocoder optimization methods At the same time, independent generation of non-overlapping
described in Section 3 can be applied not only to LPCNet or segments introduces an overhead of approximately 5 msec on
16kHz synthesis. For each test the TTS system produced no the detection of splitting frames. Moreover, we should note that
less than 15 audio records. as the splitting time is undefined the RTF can vary from phrase
We asked participants selected according to a geographic to phrase.
criterion to estimate quality of these records on five-point Lik-
ert scale, i.e. to classify a record as “Bad” (1 point), “Poor” (2 5. Conclusion
points), “Fair” (3 points), “Good” (4 points) or “Excellent” (5
points). We also included ground truth recordings and special In this work we have presented a text-to-speech system that
noisy recordings in the tests in order to keep track of the atten- is suitable for low-to-mid range mobile devices. Optimiza-
tion of the assessors and prevent random answers. The assessors tion techniques that we describe allow this system to run on
who gave less than 3 points to the ground truth records or more low-end hardware without any loss in quality. Besides, we in-
than 3 points to the noisy records were excluded from the ex- vestigated several parallelization techniques applicable to au-
periment. As in A/B tests, each piece of audio was evaluated by toregressive vocoders and showed that these techniques did not
at least 20 people. have any negative impact on the synthesized speech despite the
Table 2 demonstrates that all optimization tricks referenced fact that parallelization breaks the correlation between speech
there perform well enough in general and show the same sound samples. Further research can be focused on the development
quality as TTS with non-parallel vocoder in particular. Note of lightweight TTS vocoders capable of generating any speech
that in case of WaveRNN we used 22kHz audio which explains segment without splitting frame detection or segment overlap.
223
6. References [19] T. Osa, J. Pajarinen, G. Neumann et al., “An algorithmic perspec-
tive on imitation learning,” Foundations and Trends in Robotics,
[1] Y. Wang, R. Skerry-Ryan, D. Stanton et al., “Tacotron: Towards vol. 7, no. 1-2, p. 1–179, 2018.
end-to-end speech synthesis,” ArXiv, 2017. [Online]. Available:
https://arxiv.org/abs/1703.10135 [20] K. Ito, “The LJ Speech Dataset,” 2017.
[2] A. van den Oord, S. Dieleman, H. Zen et al., “WaveNet: A gen- [21] W. Conover, Practical nonparametric statistics, 3rd ed. New
erative model for raw audio,” in 9th ISCA Speech Synthesis Work- York, NY [u.a.]: Wiley, 1999.
shop, 2016, pp. 125–125.
[3] J. Shen, R. Pang, R. J. Weiss et al., “Natural TTS synthesis by
conditioning WaveNet on mel spectrogram predictions,” in 2018
IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2018. IEEE, April 2018, pp. 4779–4783.
[4] A. van den Oord, Y. Li, I. Babuschkin et al., “Parallel WaveNet:
Fast high-fidelity speech synthesis,” in Proceedings of the 35th
International Conference on Machine Learning, vol. 80. PMLR,
2018, pp. 3918–3926.
[5] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave
generation in end-to-end text-to-speech,” ArXiv, 2018. [Online].
Available: http://arxiv.org/abs/1807.07281
[6] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based
generative network for speech synthesis,” in 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2019. IEEE, May 2019, pp. 3617–3621.
[7] N. Kalchbrenner, E. Elsen, K. Simonyan et al., “Efficient neural
audio synthesis,” in Proceedings of the 35th International Confer-
ence on Machine Learning, vol. 80. PMLR, 10–15 Jul 2018, pp.
2410–2419.
[8] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech
synthesis through linear prediction,” in 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP
2019. IEEE, May 2019, pp. 5891–5895.
[9] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-
time speaker-dependent neural vocoder,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2018. IEEE, 2018, pp. 2251–2255.
[10] V. Popov, M. Kudinov, and T. Sadekova, “Gaussian LPCNet for
multisample speech synthesis,” in 2020 IEEE International Con-
ference on Acoustics, Speech and Signal Processing, ICASSP
2020. IEEE, 2020.
[11] M. Fink, M. Holters, and U. Zölzer, “Signal-matched power-
complementary cross-fading and dry-wet mixing,” in Proceed-
ings of the 19th International Conference on Digital Audio Effects
(DAFx-16), September 2016, pp. 109–112.
[12] [Online]. Available: https://github.com/fatchord/WaveRNN/
issues/9
[13] [Online]. Available: https://ai.googleblog.com/2020/04/
improving-audio-quality-in-duo-with.html
[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” in 3rd
International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online].
Available: http://arxiv.org/abs/1409.0473
[15] J. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based
models for speech recognition,” in Proceedings of the 28th Inter-
national Conference on Neural Information Processing Systems -
Volume 1. Cambridge, MA, USA: MIT Press, 2015, p. 577–585.
[16] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentation for monosyllabic word recognition in continuously spo-
ken sentences,” IEEE Transactions on Acoustics, Speech and Sig-
nal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[17] B. Moore, An introduction to the psychology of hearing, 5th ed.
Brill, 2012.
[18] E. Battenberg, R. J. Skerry-Ryan, S. Mariooryad et al., “Location-
relative attention mechanisms for robust long-form speech synthe-
sis,” ArXiv, vol. abs/1910.10288, 2019.
224