0% found this document useful (0 votes)
71 views

Deep Speech 3 1707.07413

This document compares three neural network models for end-to-end speech recognition: Connectionist Temporal Classification (CTC), RNN-Transducer, and sequence-to-sequence with attention. It finds that without language models, RNN-Transducer and sequence-to-sequence models outperform CTC models on benchmark datasets. When rescored with language models, RNN-Transducers perform best by simplifying decoding to neural network operations while limiting hyperparameters. The models differ in assumptions of conditional independence between predictions and handling of input-output alignments.

Uploaded by

heavywater
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Deep Speech 3 1707.07413

This document compares three neural network models for end-to-end speech recognition: Connectionist Temporal Classification (CTC), RNN-Transducer, and sequence-to-sequence with attention. It finds that without language models, RNN-Transducer and sequence-to-sequence models outperform CTC models on benchmark datasets. When rescored with language models, RNN-Transducers perform best by simplifying decoding to neural network operations while limiting hyperparameters. The models differ in assumptions of conditional independence between predictions and handling of input-output alignments.

Uploaded by

heavywater
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION

Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li,
Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, Zhenyao Zhu
Baidu Silicon Valley AI Lab

ABSTRACT • The alignment between input and output units is mono-


tonic. This is a reasonable assumption for the ASR
In this work, we perform an empirical comparison among the
task, which enables models to do streaming transcrip-
arXiv:1707.07413v1 [cs.CL] 24 Jul 2017

CTC, RNN-Transducer, and attention-based Seq2Seq models


tion. CTC and RNN-Transducers make this assump-
for end-to-end speech recognition. We show that, without
tion, but Attention models 1 do not.
any language model, Seq2Seq and RNN-Transducer mod-
els both outperform the best reported CTC models with a • Hard vs Soft alignments. CTC and RNN-Transducer
language model, on the popular Hub5’00 benchmark. On models explicitly treat alignment between input and
our internal diverse dataset, these trends continue - RNN- output as a latent variable and marginalize over all pos-
Transducer models rescored with a language model after sible hard alignments while the attention mechanism
beam search outperform our best CTC models. These results models a soft alignment between each output step and
simplify the speech recognition pipeline so that decoding can every input step. It is unclear if this matters to the ASR
now be expressed purely as neural network operations. We task.
also study how the choice of encoder architecture affects the
performance of the three models - when all encoder layers There are no conclusive studies comparing these architec-
are forward only, and when encoders downsample the input tures at scale. In this work, we train all three models on the
representation aggressively. same datasets using the same methodology, in order to per-
form a fair comparison. Models which do not assume con-
ditional independence between predictions given the full in-
1. INTRODUCTION put (viz, RNN-Transducers, Attention) are able to learn an
implicit language model from the training corpus and opti-
In recent years, deep neural networks have advanced the state-
mize WER more directly than other models. We find that
of-the-art on large scale automatic speech recognition (ASR)
they therefore perform quite competitively, even outperform-
tasks [24, 28, 2]. Deep neural networks can not only extract
ing CTC + LM models without the use of an external lan-
acoustic features, which are used as inputs to traditional ASR
guage model. Among them, RNN-Transducers have the sim-
models like Hidden Markov Models (HMM) [24, 28], but also
plest decoding procedure and fewer hyper-parameters to tune.
act as sequence transducers, which results in end-to-end neu-
In the following sections, we will first revisit the three
ral ASR systems [2, 6].
models, and describe interesting specific details of our imple-
One major challenge of sequence transduction is that
mentations. Then, in section 3, we present our results on the
the input and output sequences differ in lengths, and both
Hub5’00 benchmark (which uses ∈000 hours of training data),
lengths are variable. As a result, a speech transducer has to
and our own internal dataset (of ∞0, 000 hours). In section 4 we
learn both the alignment and the mapping between acoustic
study how well they train when using only forward-only lay-
inputs and linguistic outputs simultaneously. Several neural
ers, and when we do excessive pooling in the encoder layers
network-based speech models have been proposed during the
on the WSJ dataset by controlling the number of parameters
past years to solve this challenge. In this work, we focus
in each model. Section 6 presents related work and Section 7
on understanding the differences between these transduction
summarizes the key takeaways and presents the scope of fu-
mechanisms. Specifically, we compare three transduction
ture work.
models - Connectionist Temporal Classification (CTC) [12],
RNN-Transducer [11], and sequence-to-sequence (Seq2Seq)
with attention [5, 3]. For the ASR task, these models differ 2. NEURAL SPEECH TRANSDUCERS
mainly along assumptions made in these three axes:
A speech transducer is typically composed of an encoder
• Conditional independence between predictions at dif- (also known as acoustic model), which transforms the acous-
ferent time steps, given audio. This is not a reason- tic inputs into high level representations, and a decoder,
able assumption for the ASR task. CTC makes this as- 1 Here we focus on the vanilla Seq2Seq models with full attention [6, 3],
sumption, but RNN-Transducers and Attention models though there exist some efforts in enforcing local and monotonic attention
do not. recently, and they typically results in a loss in performance

T


T T
A

∅ A A
C
C C

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

(a) CTC (b) RNN-Transducer (c) Attention

Fig. 1. Illustration of probability transitions of three transducers on an utterance of length 5 and labelled as “CAT”. The node at t
(horizontal axis), u (vertical axis) represents the probability of having output the first u elements of the output sequence by point
t in the transcription sequence. The vertical arrow represents predicting multiple characters at one time step (not allowed for
CTC). The horizontal arrow represents predicting repeating characters (for CTC) or predicting nothing (for RNN-Transducer).
The solid arrows represent hard alignments (for CTC and RNN-Transducer) and soft ones (for Attention). As noticed, in CTC
and RNN-Transducer, states can only move towards the top right direction one step by one, while in Attention, all input frames
could potentially be attended in any decoding step.

which produces linguistic outputs (i.e, characters or words) alignments {a} are both local and monotonic.
from the encoded representations. The challenge is the input X
and output sequences have variable (also different) lengths, PCTC (y|x) = P (a|h) (1)
and usually alignments between them are unavailable. So a∈B−1 (y)
neural transducers have to learn both the classification from T0

acoustic features to linguistic predictions as well as the align-


X Y
= P (at |ht ) (2)
ment between them. Transducer models differ in the formu- a∈B−1 (y) t=1
lations of the classifier and the aligner.
P (at |ht ) = softmax(at , ht ) (3)
More formally, given the the input sequence x = (x1 , .., xT )
of length T , and the output sequence y = (y1 , ..., yU ) of where we use the conventional definition of softmax2 . The
length U , with each yu being a V dimensional one-hot vec- CTC output could be decoded by greedily picking the most
tor, transducers model the conditional distribution p(y|x). likely label at each time-step3 . To make beam search effec-
The encoder maps the input x into a high level representa- tive, the conditional independence assumption is artificially
tion h = (h1 , ..., hT 0 ), which can be shorter than the input broken by the inclusion of a language model, and decoding is
(T 0 ≤ T ) with time-scale downsampling. The encoder can then the task of finding the argmax of
be built with feed-forward neural networks (DNNs) [16], re-
current neural networks (RNNs) [14], or convolution neural log(PCT C (y|x)) + αlog(PLM (y)) + βwordcount(y) (4)
networks (CNNs) [10]. The decoder defines the alignment(s)
a and the mapping from h to y. This decoding is approximate, and performed using beam
search, typically with a large beam or lattice [15, 19]. The
above equation presents a discrepancy between how these
models are trained and tested. To address this, models could
2.1. CTC be further fine-tuned with a loss function that also incorpo-
rates language model information like sMBR [24], but the
CTC [12, 2] computes the conditional probability by marginal- principle issue is still the absence of dependence between
izing all possible alignments and it assumes conditional inde- predictions.
pendence between output predictions at different time steps
given aligned inputs. An extra ‘blank’ label, which can be in- 2.2. RNN-Transducer
terpreted as no label, is introduced to map h and y to the same
length, i.e, an alignment (path) a is obtained by inserting (T 0 - RNN-Transducer [11, 14] also marginalizes over all possible
U ) blanks into y. A mapping B : a → y is defined between a alignments, like CTC does, while extending CTC by addition-
and y, which can be done by removing all blanks and repeat- ally modeling the dependencies between outputs at different
ing letters in a. The conditional probability PCT C (y|x) can 2 softmax(a, v)
P
= exp(v = a)/ k∈V exp(v = k)
be efficiently calculated using a forward-backward dynamic- 3 strictly speaking, this finds the most likely alignment, not y, but we find

programming algorithm, as detailed in [12]. Note that the that for a fully trained model P (y|x) is dominated by a single alignment
timesteps {yu , u ∈ 1, .., U }. More specifically, the prediction a location-aware hybrid attention mechanism in our experi-
of yu at time step u depends on not only aligned input h but ments, which can be described as:
also the previous predictions {y<u }.
X guattn = AttentionRNN(yu−1 , gu−1
attn
) (13)
PRT (y|x) = P (a|h) (5) eu = ComputeAttention(h, αu−1 , guattn ) (14)
a∈B−1 (y)
gu = DecoderRNN(cu , guattn , gu−1 ) (15)
0
X T
Y
= P (at |ht , y<ut ) (6) The attention mechanism allows the model to attend any-
a∈B−1 (y) t=1 where in the input sequence at each time, and thus the align-
ments can be non-local and non-monotonic. However, this
where ut donates the output timestep aligned to the input excessive generality comes with a more complicated decod-
timestep t. An extra recurrent network is used to help deter- ing for the ASR task, since these models can both terminate
mine at by predicting decoder logits gu = g(y<ut ), and the prematurely as well as never terminate by repeatedly attend-
conditional distribution at time t is computed by normalizing ing over the same encoding steps. Therefore, the decoding
the summation of the ht and the gut : task finds the argmax of
P (at |ht , y<ut ) = P (at |et,u ) = softmax(at , et,u ) (7) log(PAttn (y|x))/|y|γ + βcov(α) + λlog(PLM (y)) (16)
et,u = f (ht , gu ) (8)
where γ is the length normalization hyperparameter [27]. The
f could be any parametric function, we use et,u = ht + gu coverage term “cov” encourages the model to attend over all
as in [11]. Like in CTC, the marginalized alignments {a} are encoder time steps, and stops rewarding repeated attendance
local and monotonic, and the likelihood of the label can be over the same time steps. The coverage term addresses both
calculated efficiently using dynamic programming. Decod- short as well as infinitely long decoding.
ing uses beam search as in [11], but we do not use length
normalization as originally suggested, since we do not find it 3. PERFORMANCE AT SCALE
necessary.
In this section, we compare the performance of the models on
2.3. Attention Model a public benchmark as well as our own internal dataset.
The promise of end-to-end models for ASR was the sim-
Attention model [8, 3, 5] aligns the inputs and outputs us- plification of the training and inference pipelines of speech
ing the attention mechanism. Like RNN-transducer, attention systems. End-to-end CTC models only simplified the train-
model removes the conditional independence assumption in ing process, but inference still involves decoding with mas-
the label sequence that CTC makes. Unlike CTC and RNN- sive language models, which often requires teams to build
transducer however, it does not assume monotonic alignment, and maintain complicated decoders. Since attention and
nor does it explicitly marginalize over alignments. It com- RNN-Transducers implicitly learn a language model from the
putes p(y|x) by picking a soft alignment between each output speech training corpus, rescoring or decoding using language
step and every input step. models trained solely from the text of the speech corpus, does
not contribute to improvements in WER (Table 1). When an
U
Y external LM trained on more data is available, simply rescor-
PAttn (y|x) = P (y|h) = P (yu |cu , y<u ) (9)
ing the final beam (typically small, between 32 and 256)
u=1
recovers all the performance difference (Table 3). The decod-
where cu is the context for decoding timestep u, which is ing and beam search is therefore simplified, can be expressed
computed as the sum of the entire h weighted by α (known as neural network operations and need not support massive
as attention). language models. This trend is already seen in the neural
machine translation tasks, where state-of-art NMT systems
T
X do not typically use an external language model [27].
cu = αu,t ht (10)
t=1
3.1. Hub5’00 results
T
X
αu,t = exp(eu,t )/ exp(eu,t0 )) (11) The performance of the models on the Hub5’00 benchmark
t0 =1 is presented in Table 1 along with other published results on
eu = f (h, αu−1 , gu−1 ) (12) in-domain data. All of the models in Table 1 use the standard
language model that is paired with the dataset, except for the
where gu is the hidden states of the decoder at decoding step rows marked “NO LM”. Without using any language model,
u. There exist different ways [6, 3] to compute eu . We used both the attention and RNN-Transducer models outperform
Architecture SWBD CH Model Dev Test
WER WER
CTC [4]
Iterated-CTC [29] 11.3 18.7 Greedy decoding 23.03 -
Published

BLSTM + LF MMI [21] 8.5 15.3 Beam search + LM (beam=2000) 15.9 16.44
LACE + LF MMI 4 [28] 8.3 14.8
Dilated convolutions [25] 7.7 14.5 RNN-Transducer
CTC + Gram-CTC [17] 7.3 14.7 Greedy decoding 18.99 -
BLSTM + Feature fusion[23] 7.2 12.7 Beam search (beam=32) 17.41 -
CTC [17] 9.0 17.7 + LM rescoring 15.6 16.50
RNN-Transducer Attention
Beam Search NO LM 8.5 16.4 Greedy decoding 22.67 -
Ours

Beam Search + LM 8.1 17.5 Beam search (beam=256) 18.71 -


Attention
+ Length-norm weight 19.5 -
Beam Search NO LM 8.6 17.8
Beam Search + LM 8.6 17.8 + Coverage cost 18.9 -
+ LM rescoring 16.0 16.48

Table 1. WER comparison against previous published results Table 3. Comparison of WER obtained by different transduc-
on Fisher-Switchboard Hub5’00 benchmark using in-domain tion models on the DeepSpeech dataset which has a mismatch
data. We only list results using single models here. All the between training and test distributions.
previous works reported WER using language models. We
don’t leverage any speaker information in our models, though Model Prediction
it has been shown to reduce WER in previous works [28, 25].
Ground Truth SILENCE
CTC SILENCE
the CTC model trained on the same corpus, and are highly RNN-Transducer SILENCE
competitive with the best results on this dataset. Since the LM Attention i want to get to get to get to get to
is also trained on the same training corpus, rescoring with the get to get to get to get to do that
LM has little effect on attention and RNN-Transducer models. Ground Truth play the black eyed peas songs
We found that beam search in attention worked best when CTC
using only length normalization (γ = 1, β = 0 in Equa- + Greedy lading to black irpen songs
tion 16). However, as the distribution of errors in Table 2 + Beam Search + LM leading to black european songs
show, the RNN-Transducer has no obvious problems with RNN-Transducer
pre-mature termination as the number of deletions is very + Greedy play the black eye piece songs
small even though there is no length normalization. Atten- + Beam Search play the black eye piece songs
tion and RNN-Transducer both use a beam width of 32. + LM rescore play the black eyed peas songs
Attention
Model WER Subs Ins Dels + Greedy play the black eyed pea songs
+ Beam Search play the black eyed pea songs
CTC 9.0 5.5 2.5 1.0
+ LM rescore play the black eyed peas songs
RNN-Transducer 8.1 4.7 2.6 0.8
Attention 8.6 5.4 1.2 2.0
Table 4. Samples from decoding the same utterance across
different models on the DeepSpeech dev set. We find that
Table 2. Error distribution for SWBD slice in Hub5’00
a big reason for the relatively worse WER of the attention
model could be attributed to a few utterances like the first one
which contributes to the edit distance a lot. The first exam-
3.2. DeepSpeech corpus ple shows only greedy decoding cases for all the models, the
second set shows how the prediction evolves through various
The DeepSpeech corpus contains about 10, 000 hours of stages of decoding.
speech in a diverse set of scenarios, such as far-field, with
background noise, accents etc., Additionally, the train and
targets sets are drawn from a different distribution since we the best opportunity to study the impact of language models
don’t have access to large volumes of data from the target on attention and RNN-Transducers.
distribution. We rely on external language models trained on On the development set, note that RNN-Transducer model
significantly larger corpus of text to close the gap between matches the performance of the best CTC model within 1.5
train and test distributions. This setting therefore provides us WER without any language model, and completely closes
the gap by rescoring the resulting beam of only 32 candi- by 2 along the time dimension. In short hand, the best CTC
dates. Surprisingly, attention models start from a WER sim- model is [2x2D-Conv (2), 3x2560 GRU], the best RNN-
ilar to that of CTC models after greedy decoding, but the Transducer’s encoder is [2x2D-Conv (2), 4x2048 GRU] and
two architectures make very different errors. CTC models decoder is [3x1024 Fwd-GRU]. The best attention model
have a poorer WER mainly because of mis-spellings, but the works best without a convolutional front-end, the encoder
relatively higher WER of attention models could be largely is [4x2560 GRU (4)] and the decoder is [1x512 Fwd-GRU].
attributed to noisy utterances. In these cases, the attention All models therefore have about 120M parameters. All mod-
models act similar to a language model and arbitrarily out- els were trained with a minibatch of 512 on 16 M40 gpus
put characters while repeatedly attending over the same en- using synchronous SGD, and typically converge within 70k
coder time steps. While the coverage term in Equation 16 iterations to the final solution.
helps address this issue during beam search, the greedy de-
coding cannot be improved. An example of this situation is
4. IMPACT OF ENCODER ARCHITECTURE
shown in Table 4. The monotonic left-to-right decoding of
CTC and RNN-Transducers naturally avoid these issues. Fur- In this section, we use the standard WSJ dataset to understand
ther, the coverage term only helps keep the correct answers in how the models perform with different encoding choices.
the beam and language model rescoring of the final beam is Since encoder layers are far away from the loss functions we
still required to bring the correct answers back to the top. are evaluating, one expect that an encoder that works well
on CTC would also perform well on attention and RNN-
3.3. Experimental details Transducer. However, different training targets allow for
different kinds of encoders: particularly, 1) the amount of
Data specification. Throughout the paper, all audio data is downsampling in the encoder is an important factor that im-
sampled at 16kHz and normalized to a constant power. Log- pacts both training wall clock time as well as the accuracy of
Linear or Log-Mel spectrograms (the specific type of featur- the model. 2) Encoders with forward-only layers also allow
ization is a hyper-parameter we tune over) are extracted with for streaming decoding, so we also explore that aspect. We
a hop size of 10ms and window size of 20ms, and then glob- believe that these results on the smaller and more uniform
ally normalized so that each input spectrogram bin has zero dataset should still hold at scale, and therefore focus on the
mean and unit variance. We do not use speaker information trends rather than optimizing for WER.
in any of our models. Every epoch, 40% of the utterances are We control all the models in this section to have 4 layers
randomly selected to add background noise to. of 256 bidirectional LSTM cells in the encoder, with weight
All models in Table 1, were trained on the standard noise. We perform random search over pooling in the en-
Fisher-Swbd dataset comprising of the LDC corpora (97S62, coder, whether to use a convolutional front-end, data augmen-
2004S13, 2004T19, 2005S13, 2005T19). We use a portion of tation, weight noise and optimization hyper-parameters. We
the RT02 corpus (2004S11) for hyper-parameter tuning. The report the best numbers within the first 60k iterations of train-
language model used for decoding the CTC model as well ing 6 . This search over hyper-parameter space has allowed us
as when rescoring the other models is the same 4-gram LM to match previously published results. The attention model
available for this benchmark from the Kaldi receipe [20]. The in Table 5 has a WER of 17.4 after beam search on the WSJ
language model used by all models in Table 3 is built from a dev’93 set, which matches the previously published results
sample of the common crawl dataset [26]. (17.9) in [9]. Similarly, the CTC model has better results
Model specification. All models in Tables 1 and 3 are than reported in [13]. We therefore believe that this provides
tuned independent of each other - we perform a random a good baseline to explore the trade-offs in modeling choices.
search over encoder and decoder sizes, amount of pooling,
minibatch size, choice of optimizer, learning and annealing 4.1. Forward-only encoders
rates. Further, no constraints are placed on any model, in
terms of number of parameters, wall clock time, or others. Streaming transcription is an important requirement for ASR
The training procedure mainly follows [2], and uses models. The first step towards deploying these models in this
SortaGrad, and all models use bi-directional ReLU GRU setting is to replace the bidirectional layers with forward-only
encoders with batch-normalization through depth 5 , and may recurrent layers. Note that while this immediately makes CTC
use a convolutional front-end. In short hand, [2x2D-Conv and RNN-Transducer models deployable, attention models
(2), 3x2560 GRU] represents a stack of 2 layers of 2D- still need to be able to process the entire utterance before out-
convolution followed by a stack of 3 bidirectional ReLU putting the first character. Alternatives have been proposed
GRU. “(2)” represents that the layer downsamples the input to circumvent this issue [22, 1] and build attention models
with monotonic attention and streaming decoders, but none
5 We also find that these encoder layers could be replaced with LSTM

layers with tanh activation, weight noise, and no batch normalization. In most 6 Better results are observed for all models if they are trained for 400k

cases, only 512 LSTM cells with weight noise can match the performance of iterations - e.g, a WER of 15.72 for Attention model after beam search on the
large un-regularized GRU cells with batch-normalization WSJ dev’93 set - but the conclusions of comparison remain unchanged.
of them are able to completely match the performance of the best at about 12 steps per second of audio [6]. So given the
full attention models. Nevertheless, we believe a comparison same encoder architecture, the final encoder layer on an atten-
with models with full attention is important for us to find out tion model with 3 layers of pyramidal pooling has 4× lesser
if full attention over the entire audio provides additional per- compute when compared to a CTC model. This is important
formance or improves training. In our experiment, we replace since the attention now only needs to be computed over such a
every layer of 256 bidirectional LSTM cells in the encoder small number of encoder time steps. Since RNN-Transducers
with a layer of 512 forward-only LSTM cells. and attention models can output multiple characters for the
same encoder timestep, we expect RNN-Transducers to be
Model Bidirectional Forward-only as robust as attention models as we increase the amount of
Decoding Greedy Beam Beam pooling in the encoder. While Figure 2 shows that they are
No LM + LM + LM fairly robust compared the CTC models, we find that atten-
tion models are significantly more robust. In addition, we
CTC 15.73 10.08 13.78 have successfully trained attention models with up to 5 lay-
RNN-Transducer 15.29 14.05 22.38 ers of pooling - 32× reduction in the encoder which forces to
Attention 14.99 14.07 19.19 compress one second of audio into only 3 encoder steps.
Table 5. WER of baseline models on WSJ eval’92 set. On
smaller datasets, RNN-Transducers and Attention models do 5. ALIGNMENT VISUALIZATION
not have enough data to learn a good implicit language model
and therefore perform poorer compared to CTC even after The three transduction models formulate the alignments be-
rescoring with an external LM (RNN-Transducers and Atten- tween input and output in different ways. CTC and RNN-
tion models learn a better implicit language model at scale, as Transducer models explicitly treat alignment as a latent vari-
shown in Tables 1 and 3). able and marginalize over all possible hard alignments while
attention models a soft alignment between each output step
From Table 5, we find that CTC models are significantly and every input step. In addition, RNN-Transducer and Atten-
more stable, easier to train and perform better in the forward tion models allow for producing multiple characters by read-
only setting. Also, since the attention models are quite a bit ing the same input locations while CTC can only produce one.
better than RNN-Transducer models, the full attention over Herein, we visualize the alignments learned by three mod-
all encoder time steps seems to be valuable. els to understand the formulations made by each model. Fig-
ure 3 plots the alignment for one utterance from the WSJ de-
vset. Since the alignment is computed based on ground-truth
4.2. Downsampling in the encoder text (instead of predictions), all three models produce reason-
able alignments, especially being monotonic for Attention.
Several notable observations are listed as below:
• We can see the small jumps along x-axis in the left sub-
figure, as CTC inserts blanks into output labels in order
to align with inputs.
• Multiple attending (producing characters) along the
same input (the same column) can be found in RNN-
Transducer (middle) and Attention (right) models.
• The alignments computed by CTC and RNN-Transducer
are more concentrated (or peaky) compared to that of
Attention. In addition, Attention model produces dif-
fused distributions at the beginning of the audio.

6. RELATED WORK
Fig. 2. Effect of increasing the frame-rate on WER
Segmental RNNs [18] provide another alternative way to
One effective way to control both the memory usage as model the ASR task. Segmental RNNs model P (y|x) using
well as the training time of these models is to compress along a zeroth-order CRF. While global normalization help address
the time dimension in the encoder, so that the recurrent lay- the label bias issues in CTC, we believe that the bigger issue
ers are unrolled over fewer time-steps. Previous results have is still the conditional independence assumptions made by
shown that CTC models work best at 50 steps per second of both CTC and Segmental RNNs.
audio [2] (a 2× reduction since spectrograms are often made [5, 8, 3] directly compare the WERs of attention models
at 100 steps per second of audio), and attention models work with those of CTC and RNN-transducer listed in the original
CTC RNN-Transducer Attention
het het het
o o o
der der der
r r r
ba ba ba
kcl kc kcl
ogl
go go
si si si
ex ex ex
rt e rt rt
me e me me e
yl yl yl
st st st
r
ngo
r
ngo or
gn
we we we
ll ll ll
nti nti nti
o o o
ne ne ne
xt xt xt
eay eay eay
r r r
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 0 10 20 30 40 50

Fig. 3. Visualization of learned alignments for the same utterance using CTC (left), RNN-Transducer (middle), and Attention
(right). The alignments are between ground-truth text (y-axis) and audio features fed into the decoder(x-axis). Note that
Attention does two more time-scale downsampling, which results in 4× shorter sequences (x axis) compared to the other two.

papers, without any control in either acoustic models or opti- ing improve the draft.
mization methodology. [7] did an initial controlled compari-
son over several speech transduction models, but only present
results on a small datset - TIMIT. 9. REFERENCES
There is also some recent effort [22, 1] in introducing
local and monotonic constraints into attention models espe- [1] Roee Aharoni and Yoav Goldberg. Sequence to se-
cially for online applications. These efforts will in theory quence transduction with hard monotonic attention.
bridge the modelling assumptions between attention and arXiv preprint arXiv:1611.01487, 2016.
RNN-transducer models. With these constraints, the fitting
capability of attention models would be limited, but they [2] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl
might be more robust to noisy test data in return. In other Case, Jared Casper, Bryan Catanzaro, Jingdong Chen,
words, attention models can work without extra tricks during Mike Chrzanowski, Adam Coates, Greg Diamos, et al.
beam search decoding, e.g, , coverage penalty. Deep speech 2: End-to-end speech recognition in en-
glish and mandarin. arXiv preprint arXiv:1512.02595,
2015.
7. CONCLUSION AND FUTURE WORK
We present a thorough comparison of three popular models [3] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,
for the end-to-end ASR task at scale, and find that in the Philemon Brakel, and Yoshua Bengio. End-to-end
bidirectional setting, all three models perform roughly the attention-based large vocabulary speech recognition.
same. However, these models differ in the simplicity of their abs/1508.04395, 2015. http://arxiv.org/abs/1508.04395.
training and decoding pipelines. Notably, end-to-end models
[4] Eric Battenberg, Rewon Child, Adam Coates, Christo-
trained with the CTC loss, simplify the training process but
pher Fougner, Yashesh Gaur, Jiaji Huang, Heewoo Jun,
still require to be decoded with large language models. RNN-
Ajay Kannan, Markus Kliegl, Atul Kumar, et al. Reduc-
Transducers and Attention also simplify the decoding pro-
ing bias in production speech models. arXiv preprint
cess and require the language models to be introduced only
arXiv:1705.04400, 2017.
in a post processing stage to be equally if not more effec-
tive. Between these two, RNN-Transducers have the simplest
[5] William Chan, Navdeep Jaitly, Quoc Le, and Oriol
decoding process with no extra hyper-parameters tuning for
Vinyals. Listen, attend, and spell. abs/1508.01211,
decoding, which leads us to believe that RNN-Transducers
2015. http://arxiv.org/abs/1508.01211.
present the next generation of end-to-end speech models. In
attempt to train RNN-Transducer models with the streaming
[6] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol
constraint, and in reducing computation in encoder layers, we
Vinyals. Listen, attend and spell. arXiv preprint
find that CTC and attention models still have strengths that we
arXiv:1508.01211, 2015.
aim to leverage in our future work with RNN-Transducers.
[7] Chung-Cheng Chiu, Dieterich Lawson, Yuping Luo,
8. ACKNOWLEDGEMENTS George Tucker, Kevin Swersky, Ilya Sutskever, and
Navdeep Jaitly. An online sequence-to-sequence
We would like to thank Xiangang Li, of the Baidu Speech model for noisy speech recognition. arXiv preprint
Technology Group for feedback about the work and also help- arXiv:1706.06428, 2017.
[8] Jan Chorowski, Dzmitry Bahdanau, Dmitry Serdyuk, [20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,
Kyunghyun Cho, and Yoshua Bengio. Attention-based O. Glembek, K. Veselý, N. Goel, M. Hannemann,
models for speech recognition. abs/1506.07503, 2015. P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, and
http://arxiv.org/abs/1506.07503. G. Stemmer. The Kaldi speech recognition toolkit. In
ASRU, 2011.
[9] Jan Chorowski and Navdeep Jaitly. Towards better de-
coding and language model integration in sequence to [21] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe-
sequence models. arXiv preprint arXiv:1612.02695, gah Ghahremani, Vimal Manohar, Xingyu Na, Yiming
2016. Wang, and Sanjeev Khudanpur. Purely sequence-trained
neural networks for asr based on lattice-free mmi. In IN-
[10] Ronan Collobert, Christian Puhrsch, and Gabriel Syn-
TERSPEECH, pages 2751–2755, 2016.
naeve. Wav2letter: an end-to-end convnet-based speech
recognition system. arXiv preprint arXiv:1609.03193, [22] Colin Raffel, Thang Luong, Peter J Liu, Ron J Weiss,
2016. and Douglas Eck. Online and linear-time attention
by enforcing monotonic alignments. arXiv preprint
[11] Alex Graves. Sequence transduction with recurrent neu-
arXiv:1704.00784, 2017.
ral networks. arXiv preprint arXiv:1211.3711, 2012.
[12] Alex Graves, Santiago Fernández, Faustino Gomez, and [23] George Saon, Gakuto Kurata, Tom Sercu, Kartik Au-
Jürgen Schmidhuber. Connectionist temporal classifica- dhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xi-
tion: labelling unsegmented sequence data with recur- aodong Cui, Bhuvana Ramabhadran, Michael Picheny,
rent neural networks. In Proceedings of the 23rd inter- Lynn-Li Lim, et al. English conversational telephone
national conference on Machine learning, pages 369– speech recognition by humans and machines. arXiv
376. ACM, 2006. preprint arXiv:1703.02136, 2017.

[13] Alex Graves and Navdeep Jaitly. Towards end-to-end [24] Andrew W. Senior, Hasim Sak, Felix de Chau-
speech recognition with recurrent neural networks. In mont Quitry, Tara N. Sainath, and Kanishka Rao.
Proceedings of the 31st International Conference on Acoustic modelling with cd-ctc-smbr lstm rnns. In
Machine Learning (ICML-14), pages 1764–1772, 2014. ASRU, 2015.

[14] Alex Graves, Abdel-rahman Mohamed, and Geoffrey [25] Tom Sercu and Vaibhava Goel. Dense prediction on
Hinton. Speech recognition with deep recurrent neural sequences with time-dilated convolutions for speech
networks. In ICASSP, 2013. recognition. arXiv preprint arXiv:1611.09288, 2016.

[15] Awni Y. Hannun, Andrew L. Maas, Daniel Jurafsky, and [26] Jason R Smith, Herve Saint-Amand, Magdalena Pla-
Andrew Y. Ng. First-pass large vocabulary continuous mada, Philipp Koehn, Chris Callison-Burch, and Adam
speech recognition using bi-directional recurrent DNNs. Lopez. Dirt cheap web-scale parallel text from the com-
abs/1408.2873, 2014. http://arxiv.org/abs/1408.2873. mon crawl. In ACL (1), pages 1374–1383, 2013.
[16] G.E. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mo- [27] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
hamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, Le, Mohammad Norouzi, Wolfgang Macherey, Maxim
T. Sainath, and B. Kingsbury. Deep neural networks for Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
acoustic modeling in speech recognition. IEEE Signal Google’s neural machine translation system: Bridging
Processing Magazine, 29(November):82–97, 2012. the gap between human and machine translation. arXiv
preprint arXiv:1609.08144, 2016.
[17] Hairong Liu, Zhenyao Zhu, Xiangang Li, and San-
jeev Satheesh. Gram-ctc: Automatic unit selection and [28] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank
target decomposition for sequence labelling. CoRR, Seide, Mike Seltzer, Andreas Stolcke, Dong Yu, and Ge-
abs/1703.00096, 2017. offrey Zweig. Achieving human parity in conversational
speech recognition. arXiv preprint arXiv:1610.05256,
[18] Liang Lu, Lingpeng Kong, Chris Dyer, Noah A. Smith,
2016.
and Steve Renals. Segmental recurrent neural networks
for end-to-end speech recognition. In INTERSPEECH, [29] Geoffery Zweig, Ghengzhu Yu, Jasha Droppo, and An-
2016. dreas Stolcke. Advances in all-neural speech recogni-
tion. arXiv preprint arXiv:1609.05935, 2016.
[19] Yajie Miao, Mohammad Gowayyed, and Florian Metze.
Eesen: End-to-end speech recognition using deep rnn
models and wfst-based decoding. In Automatic Speech
Recognition and Understanding (ASRU), 2015 IEEE
Workshop on, pages 167–174. IEEE, 2015.

You might also like