0% found this document useful (0 votes)
5 views

Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

Joao Carlinho
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet

Uploaded by

Joao Carlinho
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

INTERSPEECH 2020

October 25–29, 2020, Shanghai, China

Fast and lightweight on-device TTS with Tacotron2 and LPCNet


Vadim Popov, Stanislav Kamenev, Mikhail Kudinov, Sergey Repyevsky, Tasnima Sadekova,
Vitalii Bushaev, Vladimir Kryzhanovskiy, Denis Parkhomenko

Huawei Technologies Co. Ltd., Russia


[email protected], [email protected]

Abstract fied versions of two autoregressive neural architectures, namely


Tacotron2 [3] and LPCNet [8], and is capable of generating
We present a fast and lightweight on-device text-to-speech high-quality speech. We show that this model is also efficient in
system based on state-of-art methods of feature and speech gen- terms of speed and memory usage. An extensive human evalua-
eration i.e. Tacotron2 and LPCNet. We show that modifica- tion conducted for four languages reveals that the overall quality
tion of the basic pipeline combined with hardware-specific op- of the synthesized speech is high enough and the optimization
timizations and extensive usage of parallelization enables run- tricks we use do not lead to sound quality degradation.
ning TTS service even on low-end devices with faster than real- The paper structure is as follows: in Section 2 we describe
time waveform generation. Moreover, the system preserves design of the feature generation module based on Tacotron2;
high quality of speech without noticeable degradation of Mean in Section 3 various vocoder optimization techniques are dis-
Opinion Score compared to the non-optimized baseline. While cussed; in Section 4 human evaluation results are presented
the system is mostly oriented on low-to-mid range hardware we along with efficiency measurement; we conclude in Section 5.
believe that it can also be used in any CPU-based environment.
Index Terms: on-device speech synthesis, recurrent neural
vocoders, TTS optimization 2. Feature generation module
As mentioned above, our feature generation module is based on
1. Introduction Tacotron2. Tacotron2 is a sequence-to-sequence neural archi-
tecture with attention [14] carrying out a transformation from
Neural models have recently become a common solution for
an input character sequence into an output mel-spectrogram.
Text-to-Speech synthesis since they can produce high-quality
Tacotron2 makes use of an autoregressive decoder implemented
speech at reasonable computational costs. State-of-the-art
as a 2-layer LSTM with location-sensitive attention [15] and
results have been achieved by using attention-based models
convolutional postnet module which was found critical for gen-
such as Tacotron [1] for feature generation and autoregressive
erating crisp mel-spectrograms. The integration of Tacotron2
WaveNet vocoder [2] for conditional waveform generation.
and LPCNet suggested replacing output mel-spectrum fea-
As long as Tacotron operated on speech frames and also
tures [16] of the original Tacotron2 with the native features of
was capable of predicting several speech frames in one step
LPCNet i.e. 20-dimensional vector consisting of 18 Bark-scale
[1, 3], the speech generation module was considered as the main
cepstral coefficients (BFCC) [17] and 2 pitch parameters (pe-
computational bottleneck. The improvements of the original
riod, correlation). We also found that predicting normalized
autoregressive WaveNet vocoder were made in the three direc-
BFCCs made training of the feature generation model more sta-
tions: 1) development of a non-autoregressive vocoder provid-
ble so we chose the option of predicting normalized features
ing the same quality of speech [4, 5, 6]; 2) finding a lightweight
and adding a final denormalization layer.
architecture of neural network for conditional speech generation
[7, 8]; 3) decreasing number of prediction steps for the given The feature generation module and the vocoder can be ef-
target sample rate [7, 9, 10]. fectively parallelized due to the big difference in their execution
times. The overall execution time thus becomes almost equal to
While several solutions referenced above [7, 8, 9, 10] were
that of LPCNet. To facilitate this we fixed two computational
claimed to be capable of faster-than-realtime on-device speech
bottlenecks of the original architecture.
generation the requirements to the hardware are still high and
can be met only on higher-end devices which are still too ex- The first bottleneck was created by a high computational
pensive for many people. cost of Tacotron2 decoder comprising a 2-layer LSTM with
Curiously enough, although relatively complex schemes of 1024 units. We chose to decrease the size of the LSTM by a
parallel waveform generation for recurrent and autoregressive factor of 4. We should note though that for some languages
models have been proposed in [7] the simplest brute-force meth- we had to increase the number of LSTM layers from 2 to 3 to
ods have not been extensively discussed in the literature, at least prevent quality degradation.
to our knowledge. For example, the input mel-spectrogram can The second bottleneck was caused by a wide receptive field
be split into independent parts based on frame energy criterion of the postnet module. As far as the convolutions in the postnet
so each part can be processed by a separate copy of the vocoder. are not causal we have to wait until the number of frames pro-
Then the synthesized waves can be concatenated with or with- cessed by the decoder matches the size of the receptive field. It
out post-processing techniques (e.g. cross-fading [11]). Such directly influences the first frame delay or FFD i.e. a time de-
an approach though being used be practitioners (e.g. [12, 13]) lay between getting the character input and starting outputting
is not normally referenced in literature even as a baseline. the sound. For a zero-padded sequence and a stack of 1d-
In this paper, we combine various techniques of TTS model convolutions FFD is calculated as:
optimization and parallelization which can be either hardware-
dependent or independent. Our solution is based on modi- F F D = E + D ∗ dR/2e + P + V, (1)

Copyright © 2020 ISCA 220 http://dx.doi.org/10.21437/Interspeech.2020-2169


where E is the encoder execution time, V is the vocoder per independent parts can be detected with the following energy-
sample execution time, R is the width of the receptive field based criterion: either (1) overall energy in a frame is less than
of the convolution stack, D is the decoder per frame execution a threshold Bsil , or (2) ratio of energy at high frequencies to
time and P is the postnet per frame execution time. energy at low frequencies is greater than a threshold Bunv .
The original postnet consists of 5 convolutions of shape
5×1 with stride 1, so R = 21. It implies a delay of E+11D+P
msec before the wave generation starts. To alleviate this prob-
lem we change widths of the convolution kernels to [5, 3, 3, 3]
thus reducing the receptive field to 11 and decreasing the wave
generation delay to E + 6D + P . The encoder execution time
E, in turn, is almost negligible compared to 6D so there is no Figure 1: Boxes on the spectrogram correspond to silence or
much use in decreasing it. However, we should note that for unvoiced sounds.
the vanilla LPCNet Equation 1 is not completely true because
of the frame-rate subnet [8] so in our final design, we integrated Though being quite simple and intuitive, the energy-based
postnet of Tacotron2 and the frame-rate subnet of LPCNet into approach to finding splitting frames requires manual tuning of
a single module. The resulting frame-rate module carried out the thresholds Bsil and Bunv . That is why we decided to try
the following function: 1) postnet transformation of Tacotron2; another approach and to solve splitting frames detection task
2) BFCC denormalization and 3) frame-rate feature generation with neural networks. We chose a compact architecture simi-
of LPCNet. lar to LPCNet encoder: two 1D convolution layers with kernel
Finally, to improve synthesis of long sentences we replace size 3, output channels number 32 and tanh activations are fol-
location-sensitive attention with dynamic convolutional atten- lowed by a fully-connected layer with 32 units and tanh activa-
tion [18]. tion and the last sigmoid layer. So, the inputs were composed of
frame-level acoustic features (BFCCs and 16-dimensional pitch
3. Vocoder parallelization embedding) and the output was a single number i.e. the proba-
bility that splitting spectrogram at the current frame did not lead
Our investigations were carried out for LPCNet [8] – an au- to audible artifacts.
toregressive vocoder based on recurrent neural networks but in Below we reformulate the requirement not to introduce au-
principle the methods described in this section should be ap- dible artifacts in terms of the loss function. We start from an ob-
plicable (probably with slight modifications) to other recurrent servation that splitting spectrogram at “wrong” frames always
vocoders, e.g. to WaveRNN [7]. results in one particular type of defects i.e. in a clicking sound
In general, the main disadvantage of autoregressive on that frame and in an apparent vertical stripe at the corre-
vocoders suitable for mobile devices is that they can’t enjoy sponding segment of the spectrogram (see Figure 2). So, we can
the benefits of parallel computations: their autoregressive na- detect clicks by comparing two spectrograms corresponding to
ture makes independent parallel generation difficult. Moreover, speech signal synthesized with and without parallelization. We
such models are usually composed of small compact layers, so chose the following function for the measure of the perceived
using parallel computations for matrix operations also does not click strength:
make sense because the overhead on creating and synchroniz- !2
ing threads in this case is too high compared to the speedup gain (i) (i)
log2 Epar log Epar
due to parallelization. µ(Spar , Sstd ) = max + max
i∈L∩I log E (i) i∈H∩I log E (i)
std std
However, sometimes it is possible to synthesize some parts (2)
of the speech signal in a parallel manner independently so that a where we denote spectrum at the analyzed frame by S, energy
simple concatenation of the synthesized waves produces a good at a certain frequency i by E (i) , the set of low frequencies by
record with no artifacts at the borders of the parts. We refer to L, the set of high frequencies by H, the set of frequencies i for
the frames that serve as such borders as splitting frames. (i) (i)
which log Epar > log Estd > 1 by I and subscripts par and
std stand for synthesis with and without parallelization corre-
3.1. Splitting frames detection
spondingly.
If two adjacent words in a sentence are separated with a distinct
pause, the waveforms corresponding to these words can be con-
sidered approximately independent, so synthesizing these two
words in parallel should not lead to a loss in sound quality. Like-
wise, since speech samples that correspond to unvoiced sounds
are almost uncorrelated, parallel synthesis of speech parts sep-
arated with unvoiced frames is also possible. This simple intu- Figure 2: Splitting frames corresponding to vowels produce
ition is the basis of the energy-based criterion of splitting frame clicks (vertical stripes on the spectrogram).
detection.
LPCNet inputs are BFCCs [17], so applying inverse dis- There are a few points we need to make about Formula 2:
crete cosine transform to these coefficients results in an ap- 1) we divide frequency range into two sets L and H because
proximation of speech signal log-energy in the neighbourhood we found that in some cases energy of the clicks is concentrated
of certain frequencies located uniformly on Bark scale. The either in the high or low part of the spectrum; 2) our approxima-
frames containing silence can be characterized by low overall tion of log-energy obtained from BFCCs isn’t always accurate,
energy whereas frames corresponding to unvoiced sounds have so we choose maximum to serve as robust aggregation function;
most of their energy located at high frequencies (see Figure 1). 3) we square the numerator of the first term to put higher penalty
Thus, frames at which a spectrogram can be divided into nearly for low-frequency noise and decrease perceived strength of the

221
click; 4) we exclude all cases when the standard version pro-
duces higher energy sound so we consider only ratios exceed-
ing one; if no term satisfies this condition, the maximum is set
to zero.
The resulting loss function penalizes frames splitting at
which results in clicks and, on the other hand, encourages split-
ting at frames when no click is observed:

Loss(x) = pθ (x) · µ(x) + (1 − pθ (x)) · C (3) Figure 3: Unmodified waves (left) and the same waves but with
shift applied to one of them (right). Linear cross-fading works
where x denotes the analyzed frame, pθ (x) is the probability
better for the two waves on the right.
(given by the neural network with parameters θ) that this frame
can be used to divide spectrogram into parts synthesized inde-
pendently without loss in quality, µ(x) is the measure of a click Table 1: A/B testing results.
(calculated by Equation 2) that appears if this frame is consid-
ered as splitting one and C is some positive borderline value: if Which is better? Non-parallel Parallel Identical
the measure of a click µ(x) is less than C, we regard it as “no EB-splitting 23.9% 21.5% 54.6%
click”. NN-splitting 21.1% 19.4% 59.5%
To train such a network we create a dataset consisting of XF with shift 14.4% 17.5% 68.1%
pairs of audio records synthesized with and without paralleliza- Which is better? W/o shift With shift Identical
tion. To generate audio with parallelization for this dataset we Cross-fading 7.3% 43.3% 49.4%
allocate splitting frames at random. After the network is trained,
it is also possible to create a new dataset where splitting frames   α   α
are allocated in accordance with pθ (x) rather than randomly i (1) i (2)
si = 1− si + si+m , i = 0, .., N (6)
and to continue the training on the new dataset. However, we N N
found that such a training procedure that resembles imitation where M is the maximum possible shift value and W is size
learning [19] does not improve quality. of the window used for calculation L1 distance between signals
At test time, frames with pθ (x) > 0.95 were considered for (we found that minimizing L1 distance leads to the same quality
splitting. Neural models found about 10% more splitting frames as maximizing correlation). We put M = W = N/2 in our
than the energy-based algorithm for all languages we tested. experiments. We also introduce additional parameter α: the
larger α is, the longer the first wave s(1) does not fade out (high
3.2. Cross-fading cross-fading quality was observed for α ∈ [1; 3]).
Instead of detecting frames that split spectrogram into indepen- The code for splitting frames detection (both energy-
dent parts so that the corresponding synthesized waveforms can based and network-based methods) and synthesis with
simply be concatenated without any post-processing, we can fo- linear cross-fading (both with and without shift) is avail-
cus on improving the post-processing techniques that work well able at https://github.com/li1jkdaw/LPCNet_
for any choice of a splitting frame. This can be done with decent parallel/tree/code.
quality if the segments synthesized in parallel are overlapped.
As a baseline we take linear cross-fading technique [11]. 4. Performance evaluation
Assume that the vocoder is processing two segments in parallel We performed subjective human evaluation tests on Amazon
so that the first segment contains frames 1, .., k while the second Mechanical Turk. Four single-speaker datasets were used: we
one contains frames k, k + 1, .. That is, two waves overlap in k- trained models on male Italian and female English, French
th frame. Concatenating the first wave s(1) and the second wave and Spanish speakers. All the datasets are internal except the
s(2) with linear cross-fading means that in k-th frame samples English one which is LJSpeech dataset [20]. A small por-
of the resulting wave s are given by: tion of audio records that were used in our experiments is
(1) (2) available at https://li1jkdaw.github.io/LPCNet_
si = (1 − ai )si + ai si , i = 0, .., N (4) parallel.
where N is the number of samples in a frame. In our experi-
ments we took ai = i/N so that s(1) fades out uniformly. 4.1. A/B testing
An ideal choice of coefficients in the linear combination (4) In order to check that the methods of vocoder parallelization
is known to depend on correlation between s(1) and s(2) (see described in Section 3 do perform well we conducted a series
[11]). Intuitively, it is clear that for highly correlated signals of A/B tests. In each of these tests the participants were pre-
the quality of concatenation with cross-fading is less dependent sented with 25 pairs of recordings of the same sentence and
on the values of coefficients in Equation 4. In the limiting case asked to choose which one they preferred in case they heard
when s(1) is equal to s(2) the values of these coefficients do not any difference. In each pair the records were synthesized con-
even matter once they sum to one. It suggests that cross-fading ditioned on the same spectrogram since we did not want a small
quality can improve after applying a left shift to the second sig- variation in the Tacotron2 output to affect the result. The main
nal s(2) such that its correlation with s(1) increases. This idea purpose of the tests was to ensure that the parallelization did
is illustrated in Figure 3. not lead to degradation of the sound quality. As we had several
Thus, we consider linear cross-fading with shift: approaches to vocoder parallelization we carried out separate
W
A/B tests for different system design choices. For the strategy
involving synthesis of non-overlapping segments we tested two
X (2) (1)
m = arg min |si+j − si | (5)
j=0,..,M
i=0
splitting frame detection methods i.e. energy-based and neural

222
Table 2: Mean Opinion Scores for ground truth records and speech synthesized with different methods.

Dataset Duration Vocoder Ground Truth Non-parallel EB-splitting NN-splitting XF with shift
English 24 hours WaveRNN 4.46 ± 0.17 4.08 ± 0.20 4.15 ± 0.22 — 4.22 ± 0.20
English 24 hours LPCNet 3.98 ± 0.12 3.74 ± 0.10 3.78 ± 0.10 3.71 ± 0.11 3.75 ± 0.11
Italian 23 hours LPCNet 4.15 ± 0.14 3.45 ± 0.16 3.60 ± 0.16 3.42 ± 0.26 3.56 ± 0.19
French 8 hours LPCNet 4.46 ± 0.10 3.83 ± 0.17 3.86 ± 0.16 3.88 ± 0.17 3.84 ± 0.19
Spanish 17 hours LPCNet 4.43 ± 0.12 3.54 ± 0.11 3.52 ± 0.11 3.50 ± 0.12 3.50 ± 0.18

network-based criteria (EB-splitting and NN-splitting respec- Table 3: Vocoder parallelization efficiency.
tively). For the alternative strategy involving synthesis of over-
lapping segments we tested the cross-fading with shift as the
post-processing technique (XF with shift). In the latter case, we 1 thread 2 threads 3 threads
allocated 2 splitting frames per second. Each pair in all tests FFD RTF FFD RTF FFD RTF
was evaluated by at least 20 listeners. The tests were performed MT6762 323 1.64 345 1.07 352 0.89
on English data only. Kirin950 202 1.18 213 0.75 243 0.67
The results of these tests are presented in Table 1. In more Kirin710 170 1.09 184 0.69 191 0.56
than half of cases people noticed no difference between parallel
and non-parallel versions. In the remaining cases the difference
between the presented methods was small. To obtain statisti-
cally significant results, we applied sign test [21] and concluded the difference between ground truth MOS in English tests with
that we can’t reject (at 95% confidence level) the hypothesis that LPCNet and WaveRNN.
speech synthesized with the analyzed methods of parallelization
has the same quality as the one synthesized in a normal way.
4.3. Efficiency evaluation
We also performed another A/B test to show that the lin-
ear cross-fading with shift leads to better sound quality than the To test overall performance we implemented Tacotron2 and
same method without shift (XF w/o shift). Sign test applied to LPCNet (with splitting frames detection by energy-based
the results of Table 1 allows us to reject the hypothesis that the criterion) on several mobile devices with 8 core ARM
cross-fading without shift works at least as well as the one with processors: Mediatek MT6762 (4x2.0GHz Cortex-A53 +
shift. When a splitting frame corresponds to a vowel, linear 4x1.5GHz Cortex-A53), Kirin950 (4x2.3GHz Cortex-A72 +
cross-fading without shift leads to audible artifacts. In contrast 4x1.8GHz Cortex-A53) and Kirin710 (4x2.2GHz Cortex-A73
to the previous A/B test with overlapping segments, for this one + 4x1.7GHz Cortex-A53).
we allocated 10 splitting frames per second instead of 2 to un- The whole TTS application requires only 12.5Mb of stor-
derline the mentioned problem and check if adding shift can fix age (11.4Mb for Tacotron2 and 1.1Mb for LPCNet). All
it. weights are stored as 8-bit numbers. As for the speed, we re-
port average values of Real Time Factor (RTF) and First Frame
4.2. MOS evaluation Delay (FFD, see Section 2) in Table 3. RTF is defined as the
To evaluate the overall quality and naturalness of the speech time it takes to synthesize some piece of an audio divided by its
synthesized with our TTS system, we launched Mean Opinion duration. FFD is measured in milliseconds.
Score (MOS) evaluation for four languages. Additionally, we Table 3 shows that using 3 threads for parallel vocoder gives
trained Tacotron2 + WaveRNN system on English dataset sam- almost 2x speedup resulting in faster than real-time synthesis.
pled at 22kHz to show that the vocoder optimization methods At the same time, independent generation of non-overlapping
described in Section 3 can be applied not only to LPCNet or segments introduces an overhead of approximately 5 msec on
16kHz synthesis. For each test the TTS system produced no the detection of splitting frames. Moreover, we should note that
less than 15 audio records. as the splitting time is undefined the RTF can vary from phrase
We asked participants selected according to a geographic to phrase.
criterion to estimate quality of these records on five-point Lik-
ert scale, i.e. to classify a record as “Bad” (1 point), “Poor” (2 5. Conclusion
points), “Fair” (3 points), “Good” (4 points) or “Excellent” (5
points). We also included ground truth recordings and special In this work we have presented a text-to-speech system that
noisy recordings in the tests in order to keep track of the atten- is suitable for low-to-mid range mobile devices. Optimiza-
tion of the assessors and prevent random answers. The assessors tion techniques that we describe allow this system to run on
who gave less than 3 points to the ground truth records or more low-end hardware without any loss in quality. Besides, we in-
than 3 points to the noisy records were excluded from the ex- vestigated several parallelization techniques applicable to au-
periment. As in A/B tests, each piece of audio was evaluated by toregressive vocoders and showed that these techniques did not
at least 20 people. have any negative impact on the synthesized speech despite the
Table 2 demonstrates that all optimization tricks referenced fact that parallelization breaks the correlation between speech
there perform well enough in general and show the same sound samples. Further research can be focused on the development
quality as TTS with non-parallel vocoder in particular. Note of lightweight TTS vocoders capable of generating any speech
that in case of WaveRNN we used 22kHz audio which explains segment without splitting frame detection or segment overlap.

223
6. References [19] T. Osa, J. Pajarinen, G. Neumann et al., “An algorithmic perspec-
tive on imitation learning,” Foundations and Trends in Robotics,
[1] Y. Wang, R. Skerry-Ryan, D. Stanton et al., “Tacotron: Towards vol. 7, no. 1-2, p. 1–179, 2018.
end-to-end speech synthesis,” ArXiv, 2017. [Online]. Available:
https://arxiv.org/abs/1703.10135 [20] K. Ito, “The LJ Speech Dataset,” 2017.
[2] A. van den Oord, S. Dieleman, H. Zen et al., “WaveNet: A gen- [21] W. Conover, Practical nonparametric statistics, 3rd ed. New
erative model for raw audio,” in 9th ISCA Speech Synthesis Work- York, NY [u.a.]: Wiley, 1999.
shop, 2016, pp. 125–125.
[3] J. Shen, R. Pang, R. J. Weiss et al., “Natural TTS synthesis by
conditioning WaveNet on mel spectrogram predictions,” in 2018
IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2018. IEEE, April 2018, pp. 4779–4783.
[4] A. van den Oord, Y. Li, I. Babuschkin et al., “Parallel WaveNet:
Fast high-fidelity speech synthesis,” in Proceedings of the 35th
International Conference on Machine Learning, vol. 80. PMLR,
2018, pp. 3918–3926.
[5] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave
generation in end-to-end text-to-speech,” ArXiv, 2018. [Online].
Available: http://arxiv.org/abs/1807.07281
[6] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based
generative network for speech synthesis,” in 2019 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2019. IEEE, May 2019, pp. 3617–3621.
[7] N. Kalchbrenner, E. Elsen, K. Simonyan et al., “Efficient neural
audio synthesis,” in Proceedings of the 35th International Confer-
ence on Machine Learning, vol. 80. PMLR, 10–15 Jul 2018, pp.
2410–2419.
[8] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech
synthesis through linear prediction,” in 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP
2019. IEEE, May 2019, pp. 5891–5895.
[9] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-
time speaker-dependent neural vocoder,” in 2018 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing,
ICASSP 2018. IEEE, 2018, pp. 2251–2255.
[10] V. Popov, M. Kudinov, and T. Sadekova, “Gaussian LPCNet for
multisample speech synthesis,” in 2020 IEEE International Con-
ference on Acoustics, Speech and Signal Processing, ICASSP
2020. IEEE, 2020.
[11] M. Fink, M. Holters, and U. Zölzer, “Signal-matched power-
complementary cross-fading and dry-wet mixing,” in Proceed-
ings of the 19th International Conference on Digital Audio Effects
(DAFx-16), September 2016, pp. 109–112.
[12] [Online]. Available: https://github.com/fatchord/WaveRNN/
issues/9
[13] [Online]. Available: https://ai.googleblog.com/2020/04/
improving-audio-quality-in-duo-with.html
[14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
translation by jointly learning to align and translate,” in 3rd
International Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online].
Available: http://arxiv.org/abs/1409.0473
[15] J. Chorowski, D. Bahdanau, D. Serdyuk et al., “Attention-based
models for speech recognition,” in Proceedings of the 28th Inter-
national Conference on Neural Information Processing Systems -
Volume 1. Cambridge, MA, USA: MIT Press, 2015, p. 577–585.
[16] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-
sentation for monosyllabic word recognition in continuously spo-
ken sentences,” IEEE Transactions on Acoustics, Speech and Sig-
nal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[17] B. Moore, An introduction to the psychology of hearing, 5th ed.
Brill, 2012.
[18] E. Battenberg, R. J. Skerry-Ryan, S. Mariooryad et al., “Location-
relative attention mechanisms for robust long-form speech synthe-
sis,” ArXiv, vol. abs/1910.10288, 2019.

224

You might also like