Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

Joao Carlinho

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

Joao Carlinho

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

INTERSPEECH 2020

October 25–29, 2020, Shanghai, China

Fast and lightweight on-device TTS with Tacotron2 and LPCNet

Vadim Popov, Stanislav Kamenev, Mikhail Kudinov, Sergey Repyevsky, Tasnima Sadekova,
Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko

Huawei Technologies Co. Ltd., Russia

[email protected], [email protected]

Abstract fied versions of two autoregressive neural architectures, namely

Tacotron2 [3] and LPCNet [8], and is capable of generating
We present a fast and lightweight on-device text-to-speech high-quality speech. We show that this model is also efficient in
system based on state-of-art methods of feature and speech gen- terms of speed and memory usage. An extensive human evalua-
eration i.e. Tacotron2 and LPCNet. We show that modifica- tion conducted for four languages reveals that the overall quality
tion of the basic pipeline combined with hardware-specific op- of the synthesized speech is high enough and the optimization
timizations and extensive usage of parallelization enables run- tricks we use do not lead to sound quality degradation.
ning TTS service even on low-end devices with faster than real- The paper structure is as follows: in Section 2 we describe
time waveform generation. Moreover, the system preserves design of the feature generation module based on Tacotron2;
high quality of speech without noticeable degradation of Mean in Section 3 various vocoder optimization techniques are dis-
Opinion Score compared to the non-optimized baseline. While cussed; in Section 4 human evaluation results are presented
the system is mostly oriented on low-to-mid range hardware we along with efficiency measurement; we conclude in Section 5.
believe that it can also be used in any CPU-based environment.
Index Terms: on-device speech synthesis, recurrent neural
vocoders, TTS optimization 2. Feature generation module
As mentioned above, our feature generation module is based on
1. Introduction Tacotron2. Tacotron2 is a sequence-to-sequence neural archi-
tecture with attention [14] carrying out a transformation from
Neural models have recently become a common solution for
an input character sequence into an output mel-spectrogram.
Text-to-Speech synthesis since they can produce high-quality
Tacotron2 makes use of an autoregressive decoder implemented
speech at reasonable computational costs. State-of-the-art
as a 2-layer LSTM with location-sensitive attention [15] and
results have been achieved by using attention-based models
convolutional postnet module which was found critical for gen-
such as Tacotron [1] for feature generation and autoregressive
erating crisp mel-spectrograms. The integration of Tacotron2
WaveNet vocoder [2] for conditional waveform generation.
and LPCNet suggested replacing output mel-spectrum fea-
As long as Tacotron operated on speech frames and also
tures [16] of the original Tacotron2 with the native features of
was capable of predicting several speech frames in one step
LPCNet i.e. 20-dimensional vector consisting of 18 Bark-scale
[1, 3], the speech generation module was considered as the main
cepstral coefficients (BFCC) [17] and 2 pitch parameters (pe-
computational bottleneck. The improvements of the original
riod, correlation). We also found that predicting normalized
autoregressive WaveNet vocoder were made in the three direc-
BFCCs made training of the feature generation model more sta-
tions: 1) development of a non-autoregressive vocoder provid-
ble so we chose the option of predicting normalized features
ing the same quality of speech [4, 5, 6]; 2) finding a lightweight
and adding a final denormalization layer.
architecture of neural network for conditional speech generation
[7, 8]; 3) decreasing number of prediction steps for the given The feature generation module and the vocoder can be ef-
target sample rate [7, 9, 10]. fectively parallelized due to the big difference in their execution
times. The overall execution time thus becomes almost equal to
While several solutions referenced above [7, 8, 9, 10] were
that of LPCNet. To facilitate this we fixed two computational
claimed to be capable of faster-than-realtime on-device speech
bottlenecks of the original architecture.
generation the requirements to the hardware are still high and
can be met only on higher-end devices which are still too ex- The first bottleneck was created by a high computational
pensive for many people. cost of Tacotron2 decoder comprising a 2-layer LSTM with
Curiously enough, although relatively complex schemes of 1024 units. We chose to decrease the size of the LSTM by a
parallel waveform generation for recurrent and autoregressive factor of 4. We should note though that for some languages
models have been proposed in [7] the simplest brute-force meth- we had to increase the number of LSTM layers from 2 to 3 to
ods have not been extensively discussed in the literature, at least prevent quality degradation.
to our knowledge. For example, the input mel-spectrogram can The second bottleneck was caused by a wide receptive field
be split into independent parts based on frame energy criterion of the postnet module. As far as the convolutions in the postnet
so each part can be processed by a separate copy of the vocoder. are not causal we have to wait until the number of frames pro-
Then the synthesized waves can be concatenated with or with- cessed by the decoder matches the size of the receptive field. It
out post-processing techniques (e.g. cross-fading [11]). Such directly influences the first frame delay or FFD i.e. a time de-
an approach though being used be practitioners (e.g. [12, 13]) lay between getting the character input and starting outputting
is not normally referenced in literature even as a baseline. the sound. For a zero-padded sequence and a stack of 1d-
In this paper, we combine various techniques of TTS model convolutions FFD is calculated as:
optimization and parallelization which can be either hardware-
dependent or independent. Our solution is based on modi- F F D = E + D ∗ dR/2e + P + V, (1)

where E is the encoder execution time, V is the vocoder per independent parts can be detected with the following energy-
sample execution time, R is the width of the receptive field based criterion: either (1) overall energy in a frame is less than
of the convolution stack, D is the decoder per frame execution a threshold Bsil , or (2) ratio of energy at high frequencies to
time and P is the postnet per frame execution time. energy at low frequencies is greater than a threshold Bunv .
The original postnet consists of 5 convolutions of shape
5×1 with stride 1, so R = 21. It implies a delay of E+11D+P
msec before the wave generation starts. To alleviate this prob-
lem we change widths of the convolution kernels to [5, 3, 3, 3]
thus reducing the receptive field to 11 and decreasing the wave
generation delay to E + 6D + P . The encoder execution time
E, in turn, is almost negligible compared to 6D so there is no Figure 1: Boxes on the spectrogram correspond to silence or
much use in decreasing it. However, we should note that for unvoiced sounds.
the vanilla LPCNet Equation 1 is not completely true because
of the frame-rate subnet [8] so in our final design, we integrated Though being quite simple and intuitive, the energy-based
postnet of Tacotron2 and the frame-rate subnet of LPCNet into approach to finding splitting frames requires manual tuning of
a single module. The resulting frame-rate module carried out the thresholds Bsil and Bunv . That is why we decided to try
the following function: 1) postnet transformation of Tacotron2; another approach and to solve splitting frames detection task
2) BFCC denormalization and 3) frame-rate feature generation with neural networks. We chose a compact architecture simi-
of LPCNet. lar to LPCNet encoder: two 1D convolution layers with kernel
Finally, to improve synthesis of long sentences we replace size 3, output channels number 32 and tanh activations are fol-
location-sensitive attention with dynamic convolutional atten- lowed by a fully-connected layer with 32 units and tanh activa-
tion [18]. tion and the last sigmoid layer. So, the inputs were composed of
frame-level acoustic features (BFCCs and 16-dimensional pitch
3. Vocoder parallelization embedding) and the output was a single number i.e. the proba-
bility that splitting spectrogram at the current frame did not lead
Our investigations were carried out for LPCNet [8] – an au- to audible artifacts.
toregressive vocoder based on recurrent neural networks but in Below we reformulate the requirement not to introduce au-
principle the methods described in this section should be ap- dible artifacts in terms of the loss function. We start from an ob-
plicable (probably with slight modifications) to other recurrent servation that splitting spectrogram at “wrong” frames always
vocoders, e.g. to WaveRNN [7]. results in one particular type of defects i.e. in a clicking sound
In general, the main disadvantage of autoregressive on that frame and in an apparent vertical stripe at the corre-
vocoders suitable for mobile devices is that they can’t enjoy sponding segment of the spectrogram (see Figure 2). So, we can
the benefits of parallel computations: their autoregressive na- detect clicks by comparing two spectrograms corresponding to
ture makes independent parallel generation difficult. Moreover, speech signal synthesized with and without parallelization. We
such models are usually composed of small compact layers, so chose the following function for the measure of the perceived
using parallel computations for matrix operations also does not click strength:
make sense because the overhead on creating and synchroniz- !2
ing threads in this case is too high compared to the speedup gain (i) (i)
log2 Epar log Epar
due to parallelization. µ(Spar , Sstd ) = max + max
i∈L∩I log E (i) i∈H∩I log E (i)
std std
However, sometimes it is possible to synthesize some parts (2)
of the speech signal in a parallel manner independently so that a where we denote spectrum at the analyzed frame by S, energy
simple concatenation of the synthesized waves produces a good at a certain frequency i by E (i) , the set of low frequencies by
record with no artifacts at the borders of the parts. We refer to L, the set of high frequencies by H, the set of frequencies i for
the frames that serve as such borders as splitting frames. (i) (i)
which log Epar > log Estd > 1 by I and subscripts par and
std stand for synthesis with and without parallelization corre-
3.1. Splitting frames detection
spondingly.
If two adjacent words in a sentence are separated with a distinct
pause, the waveforms corresponding to these words can be con-
sidered approximately independent, so synthesizing these two
words in parallel should not lead to a loss in sound quality. Like-
wise, since speech samples that correspond to unvoiced sounds
are almost uncorrelated, parallel synthesis of speech parts sep-
arated with unvoiced frames is also possible. This simple intu- Figure 2: Splitting frames corresponding to vowels produce
ition is the basis of the energy-based criterion of splitting frame clicks (vertical stripes on the spectrogram).
detection.
LPCNet inputs are BFCCs [17], so applying inverse dis- There are a few points we need to make about Formula 2:
crete cosine transform to these coefficients results in an ap- 1) we divide frequency range into two sets L and H because
proximation of speech signal log-energy in the neighbourhood we found that in some cases energy of the clicks is concentrated
of certain frequencies located uniformly on Bark scale. The either in the high or low part of the spectrum; 2) our approxima-
frames containing silence can be characterized by low overall tion of log-energy obtained from BFCCs isn’t always accurate,
energy whereas frames corresponding to unvoiced sounds have so we choose maximum to serve as robust aggregation function;
most of their energy located at high frequencies (see Figure 1). 3) we square the numerator of the first term to put higher penalty
Thus, frames at which a spectrogram can be divided into nearly for low-frequency noise and decrease perceived strength of the

221
click; 4) we exclude all cases when the standard version pro-
duces higher energy sound so we consider only ratios exceed-
ing one; if no term satisfies this condition, the maximum is set
to zero.
The resulting loss function penalizes frames splitting at
which results in clicks and, on the other hand, encourages split-
ting at frames when no click is observed:

Loss(x) = pθ (x) · µ(x) + (1 − pθ (x)) · C (3) Figure 3: Unmodified waves (left) and the same waves but with
shift applied to one of them (right). Linear cross-fading works
where x denotes the analyzed frame, pθ (x) is the probability
better for the two waves on the right.
(given by the neural network with parameters θ) that this frame
can be used to divide spectrogram into parts synthesized inde-
pendently without loss in quality, µ(x) is the measure of a click Table 1: A/B testing results.
(calculated by Equation 2) that appears if this frame is consid-
ered as splitting one and C is some positive borderline value: if Which is better? Non-parallel Parallel Identical
the measure of a click µ(x) is less than C, we regard it as “no EB-splitting 23.9% 21.5% 54.6%
click”. NN-splitting 21.1% 19.4% 59.5%
To train such a network we create a dataset consisting of XF with shift 14.4% 17.5% 68.1%
pairs of audio records synthesized with and without paralleliza- Which is better? W/o shift With shift Identical
tion. To generate audio with parallelization for this dataset we Cross-fading 7.3% 43.3% 49.4%
allocate splitting frames at random. After the network is trained,
it is also possible to create a new dataset where splitting frames α α
are allocated in accordance with pθ (x) rather than randomly i (1) i (2)
si = 1− si + si+m , i = 0, .., N (6)
and to continue the training on the new dataset. However, we N N
found that such a training procedure that resembles imitation where M is the maximum possible shift value and W is size
learning [19] does not improve quality. of the window used for calculation L1 distance between signals
At test time, frames with pθ (x) > 0.95 were considered for (we found that minimizing L1 distance leads to the same quality
splitting. Neural models found about 10% more splitting frames as maximizing correlation). We put M = W = N/2 in our
than the energy-based algorithm for all languages we tested. experiments. We also introduce additional parameter α: the
larger α is, the longer the first wave s(1) does not fade out (high
3.2. Cross-fading cross-fading quality was observed for α ∈ [1; 3]).
Instead of detecting frames that split spectrogram into indepen- The code for splitting frames detection (both energy-
dent parts so that the corresponding synthesized waveforms can based and network-based methods) and synthesis with
simply be concatenated without any post-processing, we can fo- linear cross-fading (both with and without shift) is avail-
cus on improving the post-processing techniques that work well able at https://github.com/li1jkdaw/LPCNet_
for any choice of a splitting frame. This can be done with decent parallel/tree/code.
quality if the segments synthesized in parallel are overlapped.
As a baseline we take linear cross-fading technique [11]. 4. Performance evaluation
Assume that the vocoder is processing two segments in parallel We performed subjective human evaluation tests on Amazon
so that the first segment contains frames 1, .., k while the second Mechanical Turk. Four single-speaker datasets were used: we
one contains frames k, k + 1, .. That is, two waves overlap in k- trained models on male Italian and female English, French
th frame. Concatenating the first wave s(1) and the second wave and Spanish speakers. All the datasets are internal except the
s(2) with linear cross-fading means that in k-th frame samples English one which is LJSpeech dataset [20]. A small por-
of the resulting wave s are given by: tion of audio records that were used in our experiments is
(1) (2) available at https://li1jkdaw.github.io/LPCNet_
si = (1 − ai )si + ai si , i = 0, .., N (4) parallel.
where N is the number of samples in a frame. In our experi-
ments we took ai = i/N so that s(1) fades out uniformly. 4.1. A/B testing
An ideal choice of coefficients in the linear combination (4) In order to check that the methods of vocoder parallelization
is known to depend on correlation between s(1) and s(2) (see described in Section 3 do perform well we conducted a series
[11]). Intuitively, it is clear that for highly correlated signals of A/B tests. In each of these tests the participants were pre-
the quality of concatenation with cross-fading is less dependent sented with 25 pairs of recordings of the same sentence and
on the values of coefficients in Equation 4. In the limiting case asked to choose which one they preferred in case they heard
when s(1) is equal to s(2) the values of these coefficients do not any difference. In each pair the records were synthesized con-
even matter once they sum to one. It suggests that cross-fading ditioned on the same spectrogram since we did not want a small
quality can improve after applying a left shift to the second sig- variation in the Tacotron2 output to affect the result. The main
nal s(2) such that its correlation with s(1) increases. This idea purpose of the tests was to ensure that the parallelization did
is illustrated in Figure 3. not lead to degradation of the sound quality. As we had several
Thus, we consider linear cross-fading with shift: approaches to vocoder parallelization we carried out separate
W
A/B tests for different system design choices. For the strategy
involving synthesis of non-overlapping segments we tested two
X (2) (1)
m = arg min |si+j − si | (5)
j=0,..,M
i=0
splitting frame detection methods i.e. energy-based and neural

222
Table 2: Mean Opinion Scores for ground truth records and speech synthesized with different methods.

Dataset Duration Vocoder Ground Truth Non-parallel EB-splitting NN-splitting XF with shift
English 24 hours WaveRNN 4.46 ± 0.17 4.08 ± 0.20 4.15 ± 0.22 — 4.22 ± 0.20
English 24 hours LPCNet 3.98 ± 0.12 3.74 ± 0.10 3.78 ± 0.10 3.71 ± 0.11 3.75 ± 0.11
Italian 23 hours LPCNet 4.15 ± 0.14 3.45 ± 0.16 3.60 ± 0.16 3.42 ± 0.26 3.56 ± 0.19
French 8 hours LPCNet 4.46 ± 0.10 3.83 ± 0.17 3.86 ± 0.16 3.88 ± 0.17 3.84 ± 0.19
Spanish 17 hours LPCNet 4.43 ± 0.12 3.54 ± 0.11 3.52 ± 0.11 3.50 ± 0.12 3.50 ± 0.18

network-based criteria (EB-splitting and NN-splitting respec- Table 3: Vocoder parallelization efficiency.
tively). For the alternative strategy involving synthesis of over-
lapping segments we tested the cross-fading with shift as the
post-processing technique (XF with shift). In the latter case, we 1 thread 2 threads 3 threads
allocated 2 splitting frames per second. Each pair in all tests FFD RTF FFD RTF FFD RTF
was evaluated by at least 20 listeners. The tests were performed MT6762 323 1.64 345 1.07 352 0.89
on English data only. Kirin950 202 1.18 213 0.75 243 0.67
The results of these tests are presented in Table 1. In more Kirin710 170 1.09 184 0.69 191 0.56
than half of cases people noticed no difference between parallel
and non-parallel versions. In the remaining cases the difference
between the presented methods was small. To obtain statisti-
cally significant results, we applied sign test [21] and concluded the difference between ground truth MOS in English tests with
that we can’t reject (at 95% confidence level) the hypothesis that LPCNet and WaveRNN.
speech synthesized with the analyzed methods of parallelization
has the same quality as the one synthesized in a normal way.
4.3. Efficiency evaluation
We also performed another A/B test to show that the lin-
ear cross-fading with shift leads to better sound quality than the To test overall performance we implemented Tacotron2 and
same method without shift (XF w/o shift). Sign test applied to LPCNet (with splitting frames detection by energy-based
the results of Table 1 allows us to reject the hypothesis that the criterion) on several mobile devices with 8 core ARM
cross-fading without shift works at least as well as the one with processors: Mediatek MT6762 (4x2.0GHz Cortex-A53 +
shift. When a splitting frame corresponds to a vowel, linear 4x1.5GHz Cortex-A53), Kirin950 (4x2.3GHz Cortex-A72 +
cross-fading without shift leads to audible artifacts. In contrast 4x1.8GHz Cortex-A53) and Kirin710 (4x2.2GHz Cortex-A73
to the previous A/B test with overlapping segments, for this one + 4x1.7GHz Cortex-A53).
we allocated 10 splitting frames per second instead of 2 to un- The whole TTS application requires only 12.5Mb of stor-
derline the mentioned problem and check if adding shift can fix age (11.4Mb for Tacotron2 and 1.1Mb for LPCNet). All
it. weights are stored as 8-bit numbers. As for the speed, we re-
port average values of Real Time Factor (RTF) and First Frame
4.2. MOS evaluation Delay (FFD, see Section 2) in Table 3. RTF is defined as the
To evaluate the overall quality and naturalness of the speech time it takes to synthesize some piece of an audio divided by its
synthesized with our TTS system, we launched Mean Opinion duration. FFD is measured in milliseconds.
Score (MOS) evaluation for four languages. Additionally, we Table 3 shows that using 3 threads for parallel vocoder gives
trained Tacotron2 + WaveRNN system on English dataset sam- almost 2x speedup resulting in faster than real-time synthesis.
pled at 22kHz to show that the vocoder optimization methods At the same time, independent generation of non-overlapping
described in Section 3 can be applied not only to LPCNet or segments introduces an overhead of approximately 5 msec on
16kHz synthesis. For each test the TTS system produced no the detection of splitting frames. Moreover, we should note that
less than 15 audio records. as the splitting time is undefined the RTF can vary from phrase
We asked participants selected according to a geographic to phrase.
criterion to estimate quality of these records on five-point Lik-
ert scale, i.e. to classify a record as “Bad” (1 point), “Poor” (2 5. Conclusion
points), “Fair” (3 points), “Good” (4 points) or “Excellent” (5
points). We also included ground truth recordings and special In this work we have presented a text-to-speech system that
noisy recordings in the tests in order to keep track of the atten- is suitable for low-to-mid range mobile devices. Optimiza-
tion of the assessors and prevent random answers. The assessors tion techniques that we describe allow this system to run on
who gave less than 3 points to the ground truth records or more low-end hardware without any loss in quality. Besides, we in-
than 3 points to the noisy records were excluded from the ex- vestigated several parallelization techniques applicable to au-
periment. As in A/B tests, each piece of audio was evaluated by toregressive vocoders and showed that these techniques did not
at least 20 people. have any negative impact on the synthesized speech despite the
Table 2 demonstrates that all optimization tricks referenced fact that parallelization breaks the correlation between speech
there perform well enough in general and show the same sound samples. Further research can be focused on the development
quality as TTS with non-parallel vocoder in particular. Note of lightweight TTS vocoders capable of generating any speech
that in case of WaveRNN we used 22kHz audio which explains segment without splitting frame detection or segment overlap.

223
6. References [19] T. Osa, J. Pajarinen, G. Neumann et al., “An algorithmic perspec-
tive on imitation learning,” Foundations and Trends in Robotics,
[1] Y. Wang, R. Skerry-Ryan, D. Stanton et al., “Tacotron: Towards vol. 7, no. 1-2, p. 1–179, 2018.
end-to-end speech synthesis,” ArXiv, 2017. [Online]. Available:
https://arxiv.org/abs/1703.10135 [20] K. Ito, “The LJ Speech Dataset,” 2017.
[2] A. van den Oord, S. Dieleman, H. Zen et al., “WaveNet: A gen- [21] W. Conover, Practical nonparametric statistics, 3rd ed. New
erative model for raw audio,” in 9th ISCA Speech Synthesis Work- York, NY [u.a.]: Wiley, 1999.
shop, 2016, pp. 125–125.
[3] J. Shen, R. Pang, R. J. Weiss et al., “Natural TTS synthesis by
conditioning WaveNet on mel spectrogram predictions,” in 2018
IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2018. IEEE, April 2018, pp. 4779–4783.
[4] A. van den Oord, Y. Li, I. Babuschkin et al., “Parallel WaveNet:
Fast high-fidelity speech synthesis,” in Proceedings of the 35th
International Conference on Machine Learning, vol. 80. PMLR,
2018, pp. 3918–3926.
[5] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave
generation in end-to-end text-to-speech,” ArXiv, 2018. [Online].
Available: http://arxiv.org/abs/1807.07281
[6] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based
generative network for speech synthesis,” in 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2019. IEEE, May 2019, pp. 3617–3621.
[7] N. Kalchbrenner, E. Elsen, K. Simonyan et al., “Efficient neural
audio synthesis,” in Proceedings of the 35th International Confer-
ence on Machine Learning, vol. 80. PMLR, 10–15 Jul 2018, pp.
2410–2419.
[8] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech
synthesis through linear prediction,” in 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP
2019. IEEE, May 2019, pp. 5891–5895.
[9] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-
time speaker-dependent neural vocoder,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2018. IEEE, 2018, pp. 2251–2255.
[10] V. Popov, M. Kudinov, and T. Sadekova, “Gaussian LPCNet for
multisample speech synthesis,” in 2020 IEEE International Con-
ference on Acoustics, Speech and Signal Processing, ICASSP
2020. IEEE, 2020.
[11] M. Fink, M. Holters, and U. Zölzer, “Signal-matched power-
complementary cross-fading and dry-wet mixing,” in Proceed-
ings of the 19th International Conference on Digital Audio Effects
(DAFx-16), September 2016, pp. 109–112.
[12] [Online]. Available: https://github.com/fatchord/WaveRNN/
issues/9
[13] [Online]. Available: https://ai.googleblog.com/2020/04/
improving-audio-quality-in-duo-with.html
[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” in 3rd
International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online].
Available: http://arxiv.org/abs/1409.0473
[15] J. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based
models for speech recognition,” in Proceedings of the 28th Inter-
national Conference on Neural Information Processing Systems -
Volume 1. Cambridge, MA, USA: MIT Press, 2015, p. 577–585.
[16] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentation for monosyllabic word recognition in continuously spo-
ken sentences,” IEEE Transactions on Acoustics, Speech and Sig-
nal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[17] B. Moore, An introduction to the psychology of hearing, 5th ed.
Brill, 2012.
[18] E. Battenberg, R. J. Skerry-Ryan, S. Mariooryad et al., “Location-
relative attention mechanisms for robust long-form speech synthe-
sis,” ArXiv, vol. abs/1910.10288, 2019.

224

Complete Download Statistics Plain and Simple 3rd Edition Sherri L. Jackson PDF All Chapters
100% (1)
Complete Download Statistics Plain and Simple 3rd Edition Sherri L. Jackson PDF All Chapters
77 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
Key technologies for NG-PON2 system
From Everand
Key technologies for NG-PON2 system
Rawa Muayad
No ratings yet
tacotron2
No ratings yet
tacotron2
5 pages
Tacotron2
No ratings yet
Tacotron2
5 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
Thesis
No ratings yet
Thesis
37 pages
Lightweight_End-to-end_Text-to-speech_Synthesis_fo
No ratings yet
Lightweight_End-to-end_Text-to-speech_Synthesis_fo
6 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
2211.09536v3
No ratings yet
2211.09536v3
8 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
Suoni
No ratings yet
Suoni
38 pages
AudioGen
No ratings yet
AudioGen
16 pages
NLPReport Phase 1
No ratings yet
NLPReport Phase 1
5 pages
Multi-Band_Melgan_Faster_Waveform_Generation_For_High-Quality_Text-To-Speech
No ratings yet
Multi-Band_Melgan_Faster_Waveform_Generation_For_High-Quality_Text-To-Speech
7 pages
1507 08240
No ratings yet
1507 08240
8 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
TTS-Portuguese Corpus: A Corpus For Speech Synthesis in Brazilian Portuguese
No ratings yet
TTS-Portuguese Corpus: A Corpus For Speech Synthesis in Brazilian Portuguese
6 pages
TTS 02
No ratings yet
TTS 02
10 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Us8527276 PDF
No ratings yet
Us8527276 PDF
26 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
1 base
No ratings yet
1 base
5 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Scaling-up-online-speech-recognition-using-ConvNets
No ratings yet
Scaling-up-online-speech-recognition-using-ConvNets
8 pages
2002 03788
No ratings yet
2002 03788
5 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
No ratings yet
High-Fidelity and Pitch-Controllable Neural Vocoder Based On Unified Source-Filter Networks
13 pages
ClariNet
No ratings yet
ClariNet
12 pages
2502.05512v1
No ratings yet
2502.05512v1
5 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Vocoder Summer School 2021
No ratings yet
Vocoder Summer School 2021
298 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
2503.01710v1
No ratings yet
2503.01710v1
22 pages
imp tts
No ratings yet
imp tts
4 pages
ssw9 PS2-13 Wu
No ratings yet
ssw9 PS2-13 Wu
6 pages
Emotional_Speech_Synthesis_using_End-to-End_neural_TTS_models
No ratings yet
Emotional_Speech_Synthesis_using_End-to-End_neural_TTS_models
7 pages
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
No ratings yet
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
18 pages
Portable and High-Quality
No ratings yet
Portable and High-Quality
19 pages
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
219 pages
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
Effects of Dataset Sampling Rate for Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate for Noise Cancellation Through Deep Learning
16 pages
Audio Wave Net
No ratings yet
Audio Wave Net
15 pages
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
No ratings yet
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
5 pages
Flowtron
No ratings yet
Flowtron
10 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
MSGLN
No ratings yet
MSGLN
10 pages
2111.00962v3
No ratings yet
2111.00962v3
5 pages
1707 06519
No ratings yet
1707 06519
8 pages
informatics-08-00084
No ratings yet
informatics-08-00084
15 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Translating Neural Signals To Text Using A Brain-Machine Interface
No ratings yet
Translating Neural Signals To Text Using A Brain-Machine Interface
10 pages
Icbaie52039 2021 9389968
No ratings yet
Icbaie52039 2021 9389968
5 pages
Economics Syllabus Wilson College 2024
100% (1)
Economics Syllabus Wilson College 2024
32 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
Quantitative Aptitude
No ratings yet
Quantitative Aptitude
36 pages
G7 Water Quality Prediction Using Machine Learning
No ratings yet
G7 Water Quality Prediction Using Machine Learning
11 pages
Instant Download Quantitative Investment Analysis, 4th Edition Cfa Institute PDF All Chapter
100% (1)
Instant Download Quantitative Investment Analysis, 4th Edition Cfa Institute PDF All Chapter
54 pages
SMA - Assignment Description - Vi Tran
No ratings yet
SMA - Assignment Description - Vi Tran
11 pages
Tech Report03 2
No ratings yet
Tech Report03 2
55 pages
18th Meeting of WHO VCAG
No ratings yet
18th Meeting of WHO VCAG
28 pages
Application of Geostatistics in Facies Modeling of Reservoir E, "Hatch Field" Offshore Niger Delta Basin, Nigeria
No ratings yet
Application of Geostatistics in Facies Modeling of Reservoir E, "Hatch Field" Offshore Niger Delta Basin, Nigeria
13 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
All The Previous Questions
No ratings yet
All The Previous Questions
37 pages
A Beginner S Guide To Structural Equation Modeling 2nd Ed Edition Randall E. Schumacker Download PDF
100% (16)
A Beginner S Guide To Structural Equation Modeling 2nd Ed Edition Randall E. Schumacker Download PDF
70 pages
PrePhD CourseWork May 2022
No ratings yet
PrePhD CourseWork May 2022
15 pages
Iklim Terhadap Kejadian DBD Di Cimahi (M. Ezza, 2018) PDF
No ratings yet
Iklim Terhadap Kejadian DBD Di Cimahi (M. Ezza, 2018) PDF
11 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
3 pages
I PUC ECO_QB_2024_Eng
No ratings yet
I PUC ECO_QB_2024_Eng
37 pages
Statistics 1 1
No ratings yet
Statistics 1 1
46 pages
Fs University: Syllabus
No ratings yet
Fs University: Syllabus
71 pages
awiseman,+FIRE Vol6 Iss2 pgs125-140 AvarSadi 2020
No ratings yet
awiseman,+FIRE Vol6 Iss2 pgs125-140 AvarSadi 2020
16 pages
melkamu yante dairy farming
No ratings yet
melkamu yante dairy farming
68 pages
Research Methodology
No ratings yet
Research Methodology
6 pages
Exam 2 2002
No ratings yet
Exam 2 2002
7 pages
MSC Data Science 2022
No ratings yet
MSC Data Science 2022
102 pages
(Ebook) Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach by Gregory W. Corder, Dale I. Foreman ISBN 9780470454619, 047045461X - The ebook is ready for instant download and access
100% (2)
(Ebook) Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach by Gregory W. Corder, Dale I. Foreman ISBN 9780470454619, 047045461X - The ebook is ready for instant download and access
50 pages
Social Media Marketing
No ratings yet
Social Media Marketing
18 pages
Casus - Joseph Conrad
No ratings yet
Casus - Joseph Conrad
25 pages
Statical Chapman
100% (1)
Statical Chapman
385 pages
(Ebook) Modern Pricing of Interest-Rate Derivatives: The LIBOR Market Model and Beyond by Riccardo Rebonato ISBN 9781400829323 instant download
100% (1)
(Ebook) Modern Pricing of Interest-Rate Derivatives: The LIBOR Market Model and Beyond by Riccardo Rebonato ISBN 9781400829323 instant download
51 pages
Comer Summary
No ratings yet
Comer Summary
50 pages

Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

INTERSPEECH 2020

October 25–29, 2020, Shanghai, China

Fast and lightweight on-device TTS with Tacotron2 and LPCNet

Huawei Technologies Co. Ltd., Russia

Abstract fied versions of two autoregressive neural architectures, namely

Copyright © 2020 ISCA 220 http://dx.doi.org/10.21437/Interspeech.2020-2169

You might also like