0% found this document useful (0 votes)
56 views16 pages

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

This document summarizes Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS), two methods for end-to-end automatic speech recognition (ASR). CTC uses an objective function to compute the probability of an output sequence given inputs by summing over possible alignments, and can be efficiently calculated using a forward-backward algorithm. LAS is a sequence-to-sequence model that consists of an encoder-decoder architecture with attention, making it a discriminative model that directly predicts character sequences. It does not make independence assumptions about the output sequence probabilities.

Uploaded by

Sammy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views16 pages

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

This document summarizes Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS), two methods for end-to-end automatic speech recognition (ASR). CTC uses an objective function to compute the probability of an output sequence given inputs by summing over possible alignments, and can be efficiently calculated using a forward-backward algorithm. LAS is a sequence-to-sequence model that consists of an encoder-decoder architecture with attention, making it a discriminative model that directly predicts character sequences. It does not make independence assumptions about the output sequence probabilities.

Uploaded by

Sammy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

End-to-end Neural Architectures

For ASR

Lecture 15

CS 753
Instructor: Preethi Jyothi
Connectionist Temporal Classification (CTC): Recap

• CTC objective function is the probability of an


output label sequence y given an utterance x
h h h h h h h h h h (by summing over all possible alignments for y
−1
e e e e e e e e e e
provided by B (y)): 

l l l l l l l l l l

o o o o o o o o o o

ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ
CTC(x, y) = Pr(y | x) = Pr(a | x)
−1
a∈B (y)
T
h e ϵ l l ϵ l l o o

∑ ∏
= Pr(at | x)
h h e l l ϵ ϵ l ϵ o −1
a∈B (y) t=1
ϵ e ϵ l l ϵ ϵ l o o

• Efficient forward+backward algorithm to


h e l l o
compute this loss function and its gradients
e l l o

h e l o
Image from: https://distill.pub/2017/ctc/
Illustration: Forward Algorithm to compute α
! t( j)
j


x1 x2 x3 x4 x5 x6 αt( j) = αt−1(i)aijbt(y′j )
ϵ i=j−2

where
a
bt(y′j ) is the probability given by NN to the symbol

y′j for t = 1…T,  when  | x | = T

{ϵ otherwise 
ϵ yj/2 if j is even
y′j = ( j = 1…2l + 1 when  | y | = l)

b 1 if i = j or i = j − 1
aij = 1 if i = j − 2 and y′j ≠ y′j−2
0 otherwise
ϵ


CTC(x, y) = Pr(a | x) = αT (2l) + αT (2l + 1)
−1
a∈B (y)

Image from: https://distill.pub/2017/ctc/


and other methods to section 5.
plicated systems
s, language mod- 2. MODEL
Each of these CTC vs. LAS
g probability dis- In this section, we formally describe LAS. Let x = (x1 , . . . , xT )
age models and be the input sequence of filter bank spectra features and y =
• Works well for end-to-end ASR systems
kovian indepen- (hsosi, y1 , . . . , yS , heosi), yi 2 {a, · · · , z, 0, · · · , 9, hspacei,
quence. Connec- hcommai, hperiodi, hapostrophei, hunki} be the output sequence
• CTC makes an assumption that the network outputs at different
MM systems as- of characters. Here hsosi and heosi are the special start-of-sentence
tions at different time steps
token, are conditionallytokens,
and end-of-sentence independent given the
respectively, andinputs
hunki are
ake their own in- unknown tokens such as accented characters.
es between these • The Listen,
LAS models Attend
eachand Spell [LAS]
character outputnetwork makes no distribu-
yi as a conditional
ng of such mod- independence
tion over the previous assumptions
charactersabout thethe
y<i and probability distribution
input signal of
x using the
the components the output
chain rule forsequences
probabilities:given the input
re updated based Y
uage models are P (y|x) = P (yi |x, y<i ) (1)
i

Spell (LAS), a • This on


Based objective makes the model a discriminative,
the sequence-to-sequence with attention frameworkend-to-end
quence signal to model, because it directly predicts the conditional probability of
sing explicit lan- character sequences, given the acoustic signal.
LAS does not LAS consists of two sub-modules: the listener and the speller.
[LAS]: Chan et al., Listen, Attend and Spell: A NN for LVCSR, ICASSP 2016
Sequence to sequence models

Encoder-decoder architecture

DECODER

ENCODER

Data
Sequence to sequence models

Encoder-decoder architecture
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR

where where is where the stateis (senone)


the stateisposter
(seno
the st
mated frommated from
the DNN, mated
the DNN,
from theprior
is the is prob
DNN, the
(senone) estimated
(senone) estimated (senone)
from the estimated
from theset,
training traini
from
an

DECODER
dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
c nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s

ENCODER
B. Procedure
B. Training Training
B. Procedure
Training Procedure
of CD-DNo
of CD-DNN-HMMs
CD-DNN-HMMs CD-DNN-HMMsCD-DNN-HMMs
can can be
be trained traine
usingcan
th
algorithm.algorithm.
The main algorithm.
The
stepsmain The
steps
involved main
involv
are ste
su
rithm takes
rithm 1, which 1, which
rithm takes
1, which
advantage advantage
of takes adv
of
the triph
and the HMMsand theofHMMs
and theofHMMs
the CD-GMM
the CD-GMM-HMM of the sy
C
Fig. 1.of Diagram
Fig. 1. Diagram ourFig. 1.of architecture
hybrid Diagram
our hybridofarchitecture
our hybrid architecture
employing aemploying aemploying
deep neural deep neural
network. a deep
network.
neural network.
The HMM logical logical
triphone triphone
logical
HMMs HMMs
triphone
that are that
HMMs
are
effectively e
The HMM models the models
The HMM theproperty
sequential sequential
models the
of property
sequential
the of signal,
speech property
the speechof the
and signal,
the DNN
speech
and the
signal,
DNN and the DNN
Sequence to sequence models

Encoder-decoder with attention
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR

DECODER
wheresi where is wherethe stateis (senone)
the stateisposter
(seno
the st
How do we mated frommated from
the DNN, mated
the DNN,
from theprior
is the is prob
DNN, the
compute αij? X (senone) estimated
(senone) estimated (senone)
from the estimated
from theset,
training traini
from
an
ci = ↵ij hj dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
αi1 αiM j
αij dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
h1 hj hM very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s

ENCODER
B. Procedure
B. Training Training
B. Procedure
Training Procedure
of CD-DNo
of CD-DNN-HMMs
CD-DNN-HMMs
CD-DNN-HMMs CD-DNN-HMMs
can can be
be trained traine
usingcan
th
algorithm.algorithm.
The mainalgorithm.
The
stepsmain The
steps
involved main
involv
are ste
su
rithm takes
rithm 1, which 1, which
rithm takes
1, which
advantage advantage
of takes adv
of
the triph
and theofHMMs
and the HMMs and theofHMMs
the CD-GMM
the CD-GMM-HMM of the sy
C
Sequence to sequence models

Encoder-decoder with attention

DECODER
How do we 

compute αij? X
ci = ↵ij hj
αi1 αiM j
αij

si
<latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit>

ENCODER
Sequence to sequence models

Encoder-decoder with attention

DECODER
! ij ← normalized(!eij)
α
αij ∈ [0,1]
X
αi1 αiM ci = ↵ij hj
αij
j
SOFTMAX

ei1 = f (si , h1 ) eij eiM


<latexit sha1_base64="iRmFvkw5aIWyo2DqYr3KimMNalw=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBEqSEmqoBuh6MZlBfuANoTJ9KYdOpmEmYkQQv0VNy4UceuHuPNvnD4W2nrgwuGce7n3niDhTGnH+bZWVtfWNzYLW8Xtnd29ffvgsKXiVFJo0pjHshMQBZwJaGqmOXQSCSQKOLSD0e3Ebz+CVCwWDzpLwIvIQLCQUaKN5Nsl8HPmjvE1DivKZ2dD3z317bJTdabAy8SdkzKao+HbX71+TNMIhKacKNV1nUR7OZGaUQ7jYi9VkBA6IgPoGipIBMrLp8eP8YlR+jiMpSmh8VT9PZGTSKksCkxnRPRQLXoT8T+vm+rwysuZSFINgs4WhSnHOsaTJHCfSaCaZ4YQKpm5FdMhkYRqk1fRhOAuvrxMWrWqe16t3V+U6zfzOAroCB2jCnLRJaqjO9RATURRhp7RK3qznqwX6936mLWuWPOZEvoD6/MHcamTVg==</latexit>

si
<latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit>

f! could be a
neural n/w

ENCODER
The Model 2.

Speller We circumvent this problem by using a pyramidal BLSTM


Th
y2 y3 y4 heosi (pBLSTM). In each successive stacked pBLSTM layer, we reduce
the time resolution by a factor of 2. In a typical deep BLSTM
ba
• The Listen, Attend & Spell (LAS) architecture is
architecture, the output at the i-th time step, from the j-th layer is du
computed as follows:
a sequence-to-sequence
j
h = BLSTM(h , h ) model
(4) consisting of
j j 1
co
fo
i i 1 i

In the pBLSTM model, we concatenate the outputs at consecutive


c1 c2 steps of each layer before feeding it to the next layer, i.e.: co
• a Listener (Listen): An acoustic model
x1 x2
j
x3
j
h x4
j 1
hi = pBLSTM(hi 1 , h2i , h2i+1 ) j 1
i
(5)
xT
em
h h h
encoder.
In Deep
our model, we stack BLSTMs
3 pBLSTMs on top of the with
bottoma pyramidal pr
s1 s2
Fig. 1: Listen,
BLSTM layerAttend and
to reduce Spell
the time (LAS)
resolution 2 = model:
3
8 times. Thisthe listener is a pyra-
structure:
allows
midal BLSTM
the attention modelreduces the
(described in the next time
section) resolution
to extract by a
the relevantencoding
information fromour input
a smaller sequence
number of times steps.xIn into high level fea-
hsosi y2 y3
tures h, the
yS 1 factor
speller
addition of
to reducing
isthe2
an inattention-based
each
resolution, the deep layer.
decoder generating the y
architecture allows the
model to learn nonlinear feature representations of the data. See
characters from h.
Figure 1 for a visualization of the pBLSTM.
The pyramidal structure also reduces the computational com-
a Speller (AttendAndSpell): An attention-
h = (h1 , . . . , hU )
•plexity. The attention mechanism in the speller U has a computa-
tional complexity of O(U S). Thus, reducing U speeds up learning
Listener based decoder. Consumes h and produces a
and inference significantly. Other neural network architectures have
been described in literature with similar motivations, including the
w
consumes hprobability
and producesdistribution
a probabilityover
distribution over character
hierarchical RNN [17], clockwork RNN [18] and CNN [19]. ov
h1 hU
characters.
sequences:
2.2. Attend and Spell
The AttendAndSpell function is computed using an attention-
ge
based LSTM transducer [10, 12]. At every output step, the trans-
h = Listen(x) (2) ac
ducer produces a probability distribution over the next character
conditioned on all the characters seen previously. The distribution
P (y |x, y ) = AttendAndSpell(y , h) (3) m
i
for yi is a function of <i
the decoder state si and context ci . The de- <i
x1 x2 x3 x4 xT coder state si is a function of the previous state si 1 , the previously m
emitted character yi 1 and context ci 1 . The context vector ci is
Figure 1 depicts these two components. We provide more details of
produced by an attention mechanism. Specifically,
ge
Fig. 1: Listen, Attend and Spell (LAS) model: the listener is a pyra-
midal BLSTM encoding our input sequence x into highImage level fea-
these components in the following sections.
from: Chan et al., Listen,ciAttend and Spell: A NN
= AttentionContext(s for LVCSR, ICASSP
i , h) (6) 2016 us
emitted and
2.2. Attend character
Spellyi and context ci 1 . The context vector ci is
1

ell (LAS) model: the listener is a pyra- produced by an attention mechanism. Specifically,
The AttendAndSpell function is computed using an attention-
input sequence x into high level fea-
ntion-based decoder generating the based
y Attend and spell
ci = AttentionContext(si , h)
LSTM transducer [10, 12]. At every output step, the trans-
si = RNN(si 1 , yi 1 , ci 1 )
(6)
(7)
ducer produces a probability distribution over the next character
conditioned onPall |x, ycharacters
(yithe <i ) = CharacterDistribution(s
seen previously. The (8)
i , cdistribution
i)

for yi where
is a function of the decoder state
CharacterDistribution is an sMLPi and with
context ci . The
softmax de-
outputs
probability distribution
xT
• Produces
over coder state
character over si is a function
a distribution
characters, of the
over
and where RNNprevious
characters stateLSTM.
is a 2 layer 1 , the previously
siconditioned on all
emitted At each time
character y step,
andi, the attention
context c mechanism,
. The context vector
AttentionContext
c is
characters seena previously
generates
i 1
context vector,
i 1
ci encapsulating
i
the information in the
model: the listener is a pyra- produced by an attention mechanism. Specifically,
Listen(x) (2) acoustic signal needed to generate the next character. The attention
uence x into high level fea- model is contentcibased - the contents of the idecoder
= AttentionContext(s , h) (6)
state si are
AttendAndSpell(y<i , h) (3)
d decoder generating the y matched to the scontents of h u representing time step u of h, to
(7)
i = RNN(si 1 , yi 1 , ci 1 )
mponents. We provide more details of generate an attention vector ↵i . The vectors hu are linearly blended
usingP (y
↵ |x,create
iito y<i )c=i . CharacterDistribution(si , ci ) (8)
wing sections.
Specifically, at each decoder timestep i, the AttentionContext
where CharacterDistribution is an MLP with softmax outputs
function computes the scalar energy ei,u for each time step u, using
distribution over character over characters, and where RNN is a 2 layer LSTM.
vector hu 2 h and si . The scalar energy ei,u is converted into
• At eachAt decoder
each time time-step
step, i, the i,
attention AttentionContext
mechanism, computes
AttentionContext a score
a probability distribution over times steps (or attention) ↵i using
Bidirectional Long Short Termfor Memory generates
each a context
encoder
a softmax vector,
step
function. u,
Thecwhichencapsulating
isoftmax is thentheconverted
probabilities information in
intothe softmax
are used as mixing
with a pyramidal structure. This modi- acoustic signal
weights needed
for blendingto generate
the listener the next
features character. The attention
(2) h u to the context vector c
probabilities
the length U of h, from T , the length that are linearly combined to compute c i
i

(3) model is
for content
output timebased
step i:- the contents of the decoder state si are
nput speech<isignals
dSpell(y , h) can be hundreds to
direct application of BLSTM for the matched to the contents of
e i,u
h u
= representing
h (s i ), (h u )i time step u of h,(9)to
slowly and produced results
We provide more details of inferior generate
to an attention vector ↵ i . The vectors hu are linearly blended
exp(ei,u )
ter
ons. a month of training time. This using
is ↵ i to create ci . ↵i,u = P (10)
u0 exp(ei,u )
0
ation AttendAndSpell has a hard time Specifically, at each decoder X timestep i, the AttentionContext
ation from a large number of input time function computes the scalarcienergy = ↵ i,u hfor
ei,u u each time step u, using (11)
vector hu 2 h and si . The scalar
u energy ei,u is converted into
where and are MLP networks. After training, the ↵i distribution Table 1: WER compa
Training and Decoding
is typically very sharp and focuses on only a few frames of h; ci can search task. The CLD
be seen as a continuous bag of weighted features of h. Figure 1 Listen, Attend and Sp
shows the LAS architecture. size of 32. Language M
• Training
Model
2.3. Learning
CLDNN-HMM
• Train the
We train theparameters of the
parameters of our model
model to maximize
to maximize the log
the log probability LAS
probability
of the correct of the training
sequences. instances
Specifically, LAS + LM Resc
X
✓˜ = max log P (yi |x, ỹ<i ; ✓) (12)

i
set. However, we note
• where ỹi
Decoding 1is the ground truth previous character or a charac- and would certainly be
ter randomly sampled (with 10% probability) from the model, i.e. Additionally, the LAS m
CharacterDistribution(si 1 , ci 1 ) using the procedure from [20]. have been reported to y
• Simple left-to-right beam search For the Listen func
2.4. Decoding and Rescoring (i.e., 256 nodes per dir
• Beams can be rescored with a language model the input. This reduce
During inference we want to find the most likely character sequence Spell function used a
given the input acoustics: weights were initialize
Experiments

networks. After training, the ↵i distribution Table 1: WER comparison on the clean and noisy Google voice
nd focuses on only a few frames of h; ci can search task. The CLDNN-HMM system is the state-of-the-art, the
s bag of weighted features of h. Figure 1 Listen, Attend and Spell (LAS) models are decoded with a beam
ure. size of 32. Language Model (LM) rescoring can be beneficial.

Model Clean WER Noisy WER


CLDNN-HMM [22] 8.0 8.9
of our model to maximize the log probability LAS 14.1 16.5
. Specifically, LAS + LM Rescoring 10.3 12.0
X
ax log P (yi |x, ỹ<i ; ✓) (12)

i
set. However, we note that the CLDNN uses unidirectional LSTMs
und truth previous character • orListen and would
a charac-function certainly
used 3 benefit
layers fromof
the BLSTM
use of a BLSTM
(512 architecture.
nodes); AttendAndSpell
with 10% probability) from the model, i.e. Additionally, the LAS model does not use convolutional filters which
used
n(si 1 , ci 1 ) using the procedure a 2-layer
from [20]. have been LSTM (256
reported to nodes)
yield 5-7% WER relative improvement [22].
For the Listen function we used 3 layers of 512 pBLSTM nodes
coring (i.e., 256 nodes per direction) on top of a BLSTM that operates on
• Constraining the beam search with a dictionary
the input. This reduced the time resolution by 8 = 2 3
times. had
The no impact on WER
nt to find the most likely character sequence Spell function used a two layer LSTM with 512 nodes each. The
s: weights were initialized with a uniform distribution U( 0.1, 0.1).
Asynchronous Stochastic Gradient Descent (ASGD) was used for
= arg max log P (y|x) (13) training our model [23]. A learning rate of 0.2 was used with a ge-
y
Analysis
CTC has also been ap
targets and n-gram langua
ever, unlike the methods
dictionaries and language
here implies training of t
and language models, ins
acter sequences directly.
trained systems, rather th
While CTC has sh
speech recognition, it is
dence between frames -
at the outputs at the oth
of Conditional Random
problem is through the us
The model proposed
architecture [8, 10] and d
LAS models the output s
chain rule decomposition
model makes no assump
is only limited by the ca
modeling such a complic
Fig. 2: Alignments between
Image from:character outputs
Chan et al., Listen, and
Attend andaudio
Spell: Asignal encompasses
pro- ICASSP
NN for LVCSR, 2016 all aspects o
c
Attention Distributions m
i
m
Fig. 2: Alignments between character outputs and audio signal pro- e
duced by the Listen, Attend and Spell (LAS) model for the utterance t
“how much would a woodchuck chuck”. The content based atten- r
tion mechanism was able to identify the start position in the audio s
sequence for the first character correctly. The alignment produced is m
generally monotonic without a need for any location based priors. e
p
Table 2: Example 1: “triple a” vs. “aaa” spelling variants. t
d
Beam Text log P WER
s
Truth call aaa roadside assistance - -
a
1 call aaa roadside assistance -0.57 0.00
t
2 call triple a roadside assistance -1.54 50.00
g
3 call trip way roadside assistance -3.50 50.00
d
4 call xxx roadside assistance -4.44 25.00
t

decomposition. It would be difficult to produce such differing tran-


scripts using CTC due to the conditional independence assumptions, W
where the distribution of the output yi at time i is conditionally inde- r
pendent of distribution yi+1 at time i+1. Conventional DNN-HMM w
systems would require both spellings to be in the pronunciation dic- t
tionary to generate both transcriptions. t
b
5. RELATED WORK c
Image from: https://arxiv.org/pdf/1508.01211.pdf
i
Attention Distributions

Image from: https://arxiv.org/pdf/1508.01211.pdf

You might also like