End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
For ASR
Lecture 15
CS 753
Instructor: Preethi Jyothi
Connectionist Temporal Classification (CTC): Recap
∑ ∏
= Pr(at | x)
h h e l l ϵ ϵ l ϵ o −1
a∈B (y) t=1
ϵ e ϵ l l ϵ ϵ l o o
h e l o
Image from: https://distill.pub/2017/ctc/
Illustration: Forward Algorithm to compute α
! t( j)
j
∑
x1 x2 x3 x4 x5 x6 αt( j) = αt−1(i)aijbt(y′j )
ϵ i=j−2
where
a
bt(y′j ) is the probability given by NN to the symbol
{ϵ otherwise
ϵ yj/2 if j is even
y′j = ( j = 1…2l + 1 when | y | = l)
b 1 if i = j or i = j − 1
aij = 1 if i = j − 2 and y′j ≠ y′j−2
0 otherwise
ϵ
∑
CTC(x, y) = Pr(a | x) = αT (2l) + αT (2l + 1)
−1
a∈B (y)
DECODER
ENCODER
Data
Sequence to sequence models
Encoder-decoder architecture
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR
DECODER
dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
c nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s
ENCODER
B. Procedure
B. Training Training
B. Procedure
Training Procedure
of CD-DNo
of CD-DNN-HMMs
CD-DNN-HMMs CD-DNN-HMMsCD-DNN-HMMs
can can be
be trained traine
usingcan
th
algorithm.algorithm.
The main algorithm.
The
stepsmain The
steps
involved main
involv
are ste
su
rithm takes
rithm 1, which 1, which
rithm takes
1, which
advantage advantage
of takes adv
of
the triph
and the HMMsand theofHMMs
and theofHMMs
the CD-GMM
the CD-GMM-HMM of the sy
C
Fig. 1.of Diagram
Fig. 1. Diagram ourFig. 1.of architecture
hybrid Diagram
our hybridofarchitecture
our hybrid architecture
employing aemploying aemploying
deep neural deep neural
network. a deep
network.
neural network.
The HMM logical logical
triphone triphone
logical
HMMs HMMs
triphone
that are that
HMMs
are
effectively e
The HMM models the models
The HMM theproperty
sequential sequential
models the
of property
sequential
the of signal,
speech property
the speechof the
and signal,
the DNN
speech
and the
signal,
DNN and the DNN
Sequence to sequence models
Encoder-decoder with attention
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR
DECODER
wheresi where is wherethe stateis (senone)
the stateisposter
(seno
the st
How do we mated frommated from
the DNN, mated
the DNN,
from theprior
is the is prob
DNN, the
compute αij? X (senone) estimated
(senone) estimated (senone)
from the estimated
from theset,
training traini
from
an
ci = ↵ij hj dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
αi1 αiM j
αij dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
h1 hj hM very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s
ENCODER
B. Procedure
B. Training Training
B. Procedure
Training Procedure
of CD-DNo
of CD-DNN-HMMs
CD-DNN-HMMs
CD-DNN-HMMs CD-DNN-HMMs
can can be
be trained traine
usingcan
th
algorithm.algorithm.
The mainalgorithm.
The
stepsmain The
steps
involved main
involv
are ste
su
rithm takes
rithm 1, which 1, which
rithm takes
1, which
advantage advantage
of takes adv
of
the triph
and theofHMMs
and the HMMs and theofHMMs
the CD-GMM
the CD-GMM-HMM of the sy
C
Sequence to sequence models
Encoder-decoder with attention
DECODER
How do we
compute αij? X
ci = ↵ij hj
αi1 αiM j
αij
si
<latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit>
ENCODER
Sequence to sequence models
Encoder-decoder with attention
DECODER
! ij ← normalized(!eij)
α
αij ∈ [0,1]
X
αi1 αiM ci = ↵ij hj
αij
j
SOFTMAX
si
<latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit>
f! could be a
neural n/w
ENCODER
The Model 2.
ell (LAS) model: the listener is a pyra- produced by an attention mechanism. Specifically,
The AttendAndSpell function is computed using an attention-
input sequence x into high level fea-
ntion-based decoder generating the based
y Attend and spell
ci = AttentionContext(si , h)
LSTM transducer [10, 12]. At every output step, the trans-
si = RNN(si 1 , yi 1 , ci 1 )
(6)
(7)
ducer produces a probability distribution over the next character
conditioned onPall |x, ycharacters
(yithe <i ) = CharacterDistribution(s
seen previously. The (8)
i , cdistribution
i)
for yi where
is a function of the decoder state
CharacterDistribution is an sMLPi and with
context ci . The
softmax de-
outputs
probability distribution
xT
• Produces
over coder state
character over si is a function
a distribution
characters, of the
over
and where RNNprevious
characters stateLSTM.
is a 2 layer 1 , the previously
siconditioned on all
emitted At each time
character y step,
andi, the attention
context c mechanism,
. The context vector
AttentionContext
c is
characters seena previously
generates
i 1
context vector,
i 1
ci encapsulating
i
the information in the
model: the listener is a pyra- produced by an attention mechanism. Specifically,
Listen(x) (2) acoustic signal needed to generate the next character. The attention
uence x into high level fea- model is contentcibased - the contents of the idecoder
= AttentionContext(s , h) (6)
state si are
AttendAndSpell(y<i , h) (3)
d decoder generating the y matched to the scontents of h u representing time step u of h, to
(7)
i = RNN(si 1 , yi 1 , ci 1 )
mponents. We provide more details of generate an attention vector ↵i . The vectors hu are linearly blended
usingP (y
↵ |x,create
iito y<i )c=i . CharacterDistribution(si , ci ) (8)
wing sections.
Specifically, at each decoder timestep i, the AttentionContext
where CharacterDistribution is an MLP with softmax outputs
function computes the scalar energy ei,u for each time step u, using
distribution over character over characters, and where RNN is a 2 layer LSTM.
vector hu 2 h and si . The scalar energy ei,u is converted into
• At eachAt decoder
each time time-step
step, i, the i,
attention AttentionContext
mechanism, computes
AttentionContext a score
a probability distribution over times steps (or attention) ↵i using
Bidirectional Long Short Termfor Memory generates
each a context
encoder
a softmax vector,
step
function. u,
Thecwhichencapsulating
isoftmax is thentheconverted
probabilities information in
intothe softmax
are used as mixing
with a pyramidal structure. This modi- acoustic signal
weights needed
for blendingto generate
the listener the next
features character. The attention
(2) h u to the context vector c
probabilities
the length U of h, from T , the length that are linearly combined to compute c i
i
(3) model is
for content
output timebased
step i:- the contents of the decoder state si are
nput speech<isignals
dSpell(y , h) can be hundreds to
direct application of BLSTM for the matched to the contents of
e i,u
h u
= representing
h (s i ), (h u )i time step u of h,(9)to
slowly and produced results
We provide more details of inferior generate
to an attention vector ↵ i . The vectors hu are linearly blended
exp(ei,u )
ter
ons. a month of training time. This using
is ↵ i to create ci . ↵i,u = P (10)
u0 exp(ei,u )
0
ation AttendAndSpell has a hard time Specifically, at each decoder X timestep i, the AttentionContext
ation from a large number of input time function computes the scalarcienergy = ↵ i,u hfor
ei,u u each time step u, using (11)
vector hu 2 h and si . The scalar
u energy ei,u is converted into
where and are MLP networks. After training, the ↵i distribution Table 1: WER compa
Training and Decoding
is typically very sharp and focuses on only a few frames of h; ci can search task. The CLD
be seen as a continuous bag of weighted features of h. Figure 1 Listen, Attend and Sp
shows the LAS architecture. size of 32. Language M
• Training
Model
2.3. Learning
CLDNN-HMM
• Train the
We train theparameters of the
parameters of our model
model to maximize
to maximize the log
the log probability LAS
probability
of the correct of the training
sequences. instances
Specifically, LAS + LM Resc
X
✓˜ = max log P (yi |x, ỹ<i ; ✓) (12)
✓
i
set. However, we note
• where ỹi
Decoding 1is the ground truth previous character or a charac- and would certainly be
ter randomly sampled (with 10% probability) from the model, i.e. Additionally, the LAS m
CharacterDistribution(si 1 , ci 1 ) using the procedure from [20]. have been reported to y
• Simple left-to-right beam search For the Listen func
2.4. Decoding and Rescoring (i.e., 256 nodes per dir
• Beams can be rescored with a language model the input. This reduce
During inference we want to find the most likely character sequence Spell function used a
given the input acoustics: weights were initialize
Experiments
networks. After training, the ↵i distribution Table 1: WER comparison on the clean and noisy Google voice
nd focuses on only a few frames of h; ci can search task. The CLDNN-HMM system is the state-of-the-art, the
s bag of weighted features of h. Figure 1 Listen, Attend and Spell (LAS) models are decoded with a beam
ure. size of 32. Language Model (LM) rescoring can be beneficial.