0% found this document useful (0 votes)

56 views16 pages

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

This document summarizes Connectionist Temporal Classification (CTC) and Listen, Attend and Spell (LAS), two methods for end-to-end automatic speech recognition (ASR). CTC uses an objective function to compute the probability of an output sequence given inputs by summing over possible alignments, and can be efficiently calculated using a forward-backward algorithm. LAS is a sequence-to-sequence model that consists of an encoder-decoder architecture with attention, making it a discriminative model that directly predicts character sequences. It does not make independence assumptions about the output sequence probabilities.

Uploaded by

Sammy K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views16 pages

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

Uploaded by

Sammy K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

End-to-end Neural Architectures

For ASR

Lecture 15

CS 753
Instructor: Preethi Jyothi
Connectionist Temporal Classification (CTC): Recap

• CTC objective function is the probability of an

output label sequence y given an utterance x
h h h h h h h h h h (by summing over all possible alignments for y
−1
e e e e e e e e e e
provided by B (y)):  
l l l l l l l l l l
 
o o o o o o o o o o
∑
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ
CTC(x, y) = Pr(y | x) = Pr(a | x)
−1
a∈B (y)
T
h e ϵ l l ϵ l l o o

∑ ∏
= Pr(at | x)
h h e l l ϵ ϵ l ϵ o −1
a∈B (y) t=1
ϵ e ϵ l l ϵ ϵ l o o

• Efficient forward+backward algorithm to

h e l l o
compute this loss function and its gradients
e l l o

h e l o
Image from: https://distill.pub/2017/ctc/
Illustration: Forward Algorithm to compute α
! t( j)
j

∑
x1 x2 x3 x4 x5 x6 αt( j) = αt−1(i)aijbt(y′j )
ϵ i=j−2

where
a
bt(y′j ) is the probability given by NN to the symbol

y′j for t = 1…T, when | x | = T

{ϵ otherwise
ϵ yj/2 if j is even
y′j = ( j = 1…2l + 1 when | y | = l)

b 1 if i = j or i = j − 1
aij = 1 if i = j − 2 and y′j ≠ y′j−2
0 otherwise
ϵ

∑
CTC(x, y) = Pr(a | x) = αT (2l) + αT (2l + 1)
−1
a∈B (y)

Image from: https://distill.pub/2017/ctc/

and other methods to section 5.
plicated systems
s, language mod- 2. MODEL
Each of these CTC vs. LAS
g probability dis- In this section, we formally describe LAS. Let x = (x1 , . . . , xT )
age models and be the input sequence of filter bank spectra features and y =
• Works well for end-to-end ASR systems
kovian indepen- (hsosi, y1 , . . . , yS , heosi), yi 2 {a, · · · , z, 0, · · · , 9, hspacei,
quence. Connec- hcommai, hperiodi, hapostrophei, hunki} be the output sequence
• CTC makes an assumption that the network outputs at different
MM systems as- of characters. Here hsosi and heosi are the special start-of-sentence
tions at different time steps
token, are conditionallytokens,
and end-of-sentence independent given the
respectively, andinputs
hunki are
ake their own in- unknown tokens such as accented characters.
es between these • The Listen,
LAS models Attend
eachand Spell [LAS]
character outputnetwork makes no distribu-
yi as a conditional
ng of such mod- independence
tion over the previous assumptions
charactersabout thethe
y<i and probability distribution
input signal of
x using the
the components the output
chain rule forsequences
probabilities:given the input
re updated based Y
uage models are P (y|x) = P (yi |x, y<i ) (1)
i

Spell (LAS), a • This on

Based objective makes the model a discriminative,
the sequence-to-sequence with attention frameworkend-to-end
quence signal to model, because it directly predicts the conditional probability of
sing explicit lan- character sequences, given the acoustic signal.
LAS does not LAS consists of two sub-modules: the listener and the speller.
[LAS]: Chan et al., Listen, Attend and Spell: A NN for LVCSR, ICASSP 2016
Sequence to sequence models 
Encoder-decoder architecture

DECODER

ENCODER

Data
Sequence to sequence models 
Encoder-decoder architecture
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR

where where is where the stateis (senone)

the stateisposter
(seno
the st
mated frommated from
the DNN, mated
the DNN,
from theprior
is the is prob
DNN, the
(senone) estimated
(senone) estimated (senone)
from the estimated
from theset,
training traini
from
an

DECODER
dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
c nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s

ENCODER
B. Procedure
B. Training Training
B. Procedure
Training Procedure
of CD-DNo
of CD-DNN-HMMs
CD-DNN-HMMs CD-DNN-HMMsCD-DNN-HMMs
can can be
be trained traine
usingcan
th
algorithm.algorithm.
The main algorithm.
The
stepsmain The
steps
involved main
involv
are ste
su
rithm takes
rithm 1, which 1, which
rithm takes
1, which
advantage advantage
of takes adv
of
the triph
and the HMMsand theofHMMs
and theofHMMs
the CD-GMM
the CD-GMM-HMM of the sy
C
Fig. 1.of Diagram
Fig. 1. Diagram ourFig. 1.of architecture
hybrid Diagram
our hybridofarchitecture
our hybrid architecture
employing aemploying aemploying
deep neural deep neural
network. a deep
network.
neural network.
The HMM logical logical
triphone triphone
logical
HMMs HMMs
triphone
that are that
HMMs
are
effectively e
The HMM models the models
The HMM theproperty
sequential sequential
models the
of property
sequential
the of signal,
speech property
the speechof the
and signal,
the DNN
speech
and the
signal,
DNN and the DNN
Sequence to sequence models 
Encoder-decoder with attention
DAHL et al.: DAHL et al.: DAHL
CONTEXT-DEPENDENT
CONTEXT-DEPENDENT et al.: CONTEXT-DEPENDENT
PRE-TRAINED
PRE-TRAINED PRE-TRAINED
DEEP
DEEP NEURAL NEURAL
DEEP
NETWORKS NETWORKS
NEURAL
FOR LVSR NETWORKS
FOR LVSR FOR LVSR

DECODER
wheresi where is wherethe stateis (senone)
the stateisposter
(seno
the st
How do we mated frommated from
the DNN, mated
the DNN,
from theprior
is the is prob
DNN, the
compute αij? X (senone) estimated
(senone) estimated (senone)
from the estimated
from theset,
training traini
from
an
ci = ↵ij hj dent
dent of the of the
word dent
word
sequenceof the
sequence
andwordthus sequen
and be
can th
αi1 αiM j
αij dividing
dividing by by
the priordividing
the prior
by probability
probability the prior(call
pro
estimationestimation
by [38],estimation
by [38],
[40], [41])[40],
bymay
[38],
[41])
not[40]
m
gi
nition accuracy
nition accuracy nition
under someunder
accuracy someunder
conditions, condi
so
we
h1 hj hM very important
very important very important
in alleviating
in alleviating in allevia
the label the l
bias
when the when theutterances
training when
training
theutterances
training utteran
conta
contain long s

DECODER
How do we  
compute αij? X
ci = ↵ij hj
αi1 αiM j
αij

si
<latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit>

ENCODER
Sequence to sequence models 
Encoder-decoder with attention

DECODER
! ij ← normalized(!eij)
α
αij ∈ [0,1]
X
αi1 αiM ci = ↵ij hj
αij
j
SOFTMAX

ei1 = f (si , h1 ) eij eiM

<latexit sha1_base64="iRmFvkw5aIWyo2DqYr3KimMNalw=">AAAB/HicbVDLSsNAFJ34rPUV7dLNYBEqSEmqoBuh6MZlBfuANoTJ9KYdOpmEmYkQQv0VNy4UceuHuPNvnD4W2nrgwuGce7n3niDhTGnH+bZWVtfWNzYLW8Xtnd29ffvgsKXiVFJo0pjHshMQBZwJaGqmOXQSCSQKOLSD0e3Ebz+CVCwWDzpLwIvIQLCQUaKN5Nsl8HPmjvE1DivKZ2dD3z317bJTdabAy8SdkzKao+HbX71+TNMIhKacKNV1nUR7OZGaUQ7jYi9VkBA6IgPoGipIBMrLp8eP8YlR+jiMpSmh8VT9PZGTSKksCkxnRPRQLXoT8T+vm+rwysuZSFINgs4WhSnHOsaTJHCfSaCaZ4YQKpm5FdMhkYRqk1fRhOAuvrxMWrWqe16t3V+U6zfzOAroCB2jCnLRJaqjO9RATURRhp7RK3qznqwX6936mLWuWPOZEvoD6/MHcamTVg==</latexit>

f! could be a
neural n/w

ENCODER
The Model 2.

Speller We circumvent this problem by using a pyramidal BLSTM

Th
y2 y3 y4 heosi (pBLSTM). In each successive stacked pBLSTM layer, we reduce
the time resolution by a factor of 2. In a typical deep BLSTM
ba
• The Listen, Attend & Spell (LAS) architecture is
architecture, the output at the i-th time step, from the j-th layer is du
computed as follows:
a sequence-to-sequence
j
h = BLSTM(h , h ) model
(4) consisting of
j j 1
co
fo
i i 1 i

In the pBLSTM model, we concatenate the outputs at consecutive

c1 c2 steps of each layer before feeding it to the next layer, i.e.: co
• a Listener (Listen): An acoustic model
x1 x2
j
x3
j
h x4
j 1
hi = pBLSTM(hi 1 , h2i , h2i+1 ) j 1
i
(5)
xT
em
h h h
encoder.
In Deep
our model, we stack BLSTMs
3 pBLSTMs on top of the with
bottoma pyramidal pr
s1 s2
Fig. 1: Listen,
BLSTM layerAttend and
to reduce Spell
the time (LAS)
resolution 2 = model:
3
8 times. Thisthe listener is a pyra-
structure:
allows
midal BLSTM
the attention modelreduces the
(described in the next time
section) resolution
to extract by a
the relevantencoding
information fromour input
a smaller sequence
number of times steps.xIn into high level fea-
hsosi y2 y3
tures h, the
yS 1 factor
speller
addition of
to reducing
isthe2
an inattention-based
each
resolution, the deep layer.
decoder generating the y
architecture allows the
model to learn nonlinear feature representations of the data. See
characters from h.
Figure 1 for a visualization of the pBLSTM.
The pyramidal structure also reduces the computational com-
a Speller (AttendAndSpell): An attention-
h = (h1 , . . . , hU )
•plexity. The attention mechanism in the speller U has a computa-
tional complexity of O(U S). Thus, reducing U speeds up learning
Listener based decoder. Consumes h and produces a
and inference significantly. Other neural network architectures have
been described in literature with similar motivations, including the
w
consumes hprobability
and producesdistribution
a probabilityover
distribution over character
hierarchical RNN [17], clockwork RNN [18] and CNN [19]. ov
h1 hU
characters.
sequences:
2.2. Attend and Spell
The AttendAndSpell function is computed using an attention-
ge
based LSTM transducer [10, 12]. At every output step, the trans-
h = Listen(x) (2) ac
ducer produces a probability distribution over the next character
conditioned on all the characters seen previously. The distribution
P (y |x, y ) = AttendAndSpell(y , h) (3) m
i
for yi is a function of <i
the decoder state si and context ci . The de- <i
x1 x2 x3 x4 xT coder state si is a function of the previous state si 1 , the previously m
emitted character yi 1 and context ci 1 . The context vector ci is
Figure 1 depicts these two components. We provide more details of
produced by an attention mechanism. Specifically,
ge
Fig. 1: Listen, Attend and Spell (LAS) model: the listener is a pyra-
midal BLSTM encoding our input sequence x into highImage level fea-
these components in the following sections.
from: Chan et al., Listen,ciAttend and Spell: A NN
= AttentionContext(s for LVCSR, ICASSP
i , h) (6) 2016 us
emitted and
2.2. Attend character
Spellyi and context ci 1 . The context vector ci is
1

ell (LAS) model: the listener is a pyra- produced by an attention mechanism. Specifically,
The AttendAndSpell function is computed using an attention-
input sequence x into high level fea-
ntion-based decoder generating the based
y Attend and spell
ci = AttentionContext(si , h)
LSTM transducer [10, 12]. At every output step, the trans-
si = RNN(si 1 , yi 1 , ci 1 )
(6)
(7)
ducer produces a probability distribution over the next character
conditioned onPall |x, ycharacters
(yithe <i ) = CharacterDistribution(s
seen previously. The (8)
i , cdistribution
i)

for yi where
is a function of the decoder state
CharacterDistribution is an sMLPi and with
context ci . The
softmax de-
outputs
probability distribution
xT
• Produces
over coder state
character over si is a function
a distribution
characters, of the
over
and where RNNprevious
characters stateLSTM.
is a 2 layer 1 , the previously
siconditioned on all
emitted At each time
character y step,
andi, the attention
context c mechanism,
. The context vector
AttentionContext
c is
characters seena previously
generates
i 1
context vector,
i 1
ci encapsulating
i
the information in the
model: the listener is a pyra- produced by an attention mechanism. Specifically,
Listen(x) (2) acoustic signal needed to generate the next character. The attention
uence x into high level fea- model is contentcibased - the contents of the idecoder
= AttentionContext(s , h) (6)
state si are
AttendAndSpell(y<i , h) (3)
d decoder generating the y matched to the scontents of h u representing time step u of h, to
(7)
i = RNN(si 1 , yi 1 , ci 1 )
mponents. We provide more details of generate an attention vector ↵i . The vectors hu are linearly blended
usingP (y
↵ |x,create
iito y<i )c=i . CharacterDistribution(si , ci ) (8)
wing sections.
Specifically, at each decoder timestep i, the AttentionContext
where CharacterDistribution is an MLP with softmax outputs
function computes the scalar energy ei,u for each time step u, using
distribution over character over characters, and where RNN is a 2 layer LSTM.
vector hu 2 h and si . The scalar energy ei,u is converted into
• At eachAt decoder
each time time-step
step, i, the i,
attention AttentionContext
mechanism, computes
AttentionContext a score
a probability distribution over times steps (or attention) ↵i using
Bidirectional Long Short Termfor Memory generates
each a context
encoder
a softmax vector,
step
function. u,
Thecwhichencapsulating
isoftmax is thentheconverted
probabilities information in
intothe softmax
are used as mixing
with a pyramidal structure. This modi- acoustic signal
weights needed
for blendingto generate
the listener the next
features character. The attention
(2) h u to the context vector c
probabilities
the length U of h, from T , the length that are linearly combined to compute c i
i

(3) model is
for content
output timebased
step i:- the contents of the decoder state si are
nput speech<isignals
dSpell(y , h) can be hundreds to
direct application of BLSTM for the matched to the contents of
e i,u
h u
= representing
h (s i ), (h u )i time step u of h,(9)to
slowly and produced results
We provide more details of inferior generate
to an attention vector ↵ i . The vectors hu are linearly blended
exp(ei,u )
ter
ons. a month of training time. This using
is ↵ i to create ci . ↵i,u = P (10)
u0 exp(ei,u )
0
ation AttendAndSpell has a hard time Specifically, at each decoder X timestep i, the AttentionContext
ation from a large number of input time function computes the scalarcienergy = ↵ i,u hfor
ei,u u each time step u, using (11)
vector hu 2 h and si . The scalar
u energy ei,u is converted into
where and are MLP networks. After training, the ↵i distribution Table 1: WER compa
Training and Decoding
is typically very sharp and focuses on only a few frames of h; ci can search task. The CLD
be seen as a continuous bag of weighted features of h. Figure 1 Listen, Attend and Sp
shows the LAS architecture. size of 32. Language M
• Training
Model
2.3. Learning
CLDNN-HMM
• Train the
We train theparameters of the
parameters of our model
model to maximize
to maximize the log
the log probability LAS
probability
of the correct of the training
sequences. instances
Specifically, LAS + LM Resc
X
✓˜ = max log P (yi |x, ỹ<i ; ✓) (12)
✓
i
set. However, we note
• where ỹi
Decoding 1is the ground truth previous character or a charac- and would certainly be
ter randomly sampled (with 10% probability) from the model, i.e. Additionally, the LAS m
CharacterDistribution(si 1 , ci 1 ) using the procedure from [20]. have been reported to y
• Simple left-to-right beam search For the Listen func
2.4. Decoding and Rescoring (i.e., 256 nodes per dir
• Beams can be rescored with a language model the input. This reduce
During inference we want to find the most likely character sequence Spell function used a
given the input acoustics: weights were initialize
Experiments

networks. After training, the ↵i distribution Table 1: WER comparison on the clean and noisy Google voice
nd focuses on only a few frames of h; ci can search task. The CLDNN-HMM system is the state-of-the-art, the
s bag of weighted features of h. Figure 1 Listen, Attend and Spell (LAS) models are decoded with a beam
ure. size of 32. Language Model (LM) rescoring can be beneficial.

Model Clean WER Noisy WER

CLDNN-HMM [22] 8.0 8.9
of our model to maximize the log probability LAS 14.1 16.5
. Specifically, LAS + LM Rescoring 10.3 12.0
X
ax log P (yi |x, ỹ<i ; ✓) (12)
✓
i
set. However, we note that the CLDNN uses unidirectional LSTMs
und truth previous character • orListen and would
a charac-function certainly
used 3 benefit
layers fromof
the BLSTM
use of a BLSTM
(512 architecture.
nodes); AttendAndSpell
with 10% probability) from the model, i.e. Additionally, the LAS model does not use convolutional filters which
used
n(si 1 , ci 1 ) using the procedure a 2-layer
from [20]. have been LSTM (256
reported to nodes)
yield 5-7% WER relative improvement [22].
For the Listen function we used 3 layers of 512 pBLSTM nodes
coring (i.e., 256 nodes per direction) on top of a BLSTM that operates on
• Constraining the beam search with a dictionary
the input. This reduced the time resolution by 8 = 2 3
times. had
The no impact on WER
nt to find the most likely character sequence Spell function used a two layer LSTM with 512 nodes each. The
s: weights were initialized with a uniform distribution U( 0.1, 0.1).
Asynchronous Stochastic Gradient Descent (ASGD) was used for
= arg max log P (y|x) (13) training our model [23]. A learning rate of 0.2 was used with a ge-
y
Analysis
CTC has also been ap
targets and n-gram langua
ever, unlike the methods
dictionaries and language
here implies training of t
and language models, ins
acter sequences directly.
trained systems, rather th
While CTC has sh
speech recognition, it is
dence between frames -
at the outputs at the oth
of Conditional Random
problem is through the us
The model proposed
architecture [8, 10] and d
LAS models the output s
chain rule decomposition
model makes no assump
is only limited by the ca
modeling such a complic
Fig. 2: Alignments between
Image from:character outputs
Chan et al., Listen, and
Attend andaudio
Spell: Asignal encompasses
pro- ICASSP
NN for LVCSR, 2016 all aspects o
c
Attention Distributions m
i
m
Fig. 2: Alignments between character outputs and audio signal pro- e
duced by the Listen, Attend and Spell (LAS) model for the utterance t
“how much would a woodchuck chuck”. The content based atten- r
tion mechanism was able to identify the start position in the audio s
sequence for the first character correctly. The alignment produced is m
generally monotonic without a need for any location based priors. e
p
Table 2: Example 1: “triple a” vs. “aaa” spelling variants. t
d
Beam Text log P WER
s
Truth call aaa roadside assistance - -
a
1 call aaa roadside assistance -0.57 0.00
t
2 call triple a roadside assistance -1.54 50.00
g
3 call trip way roadside assistance -3.50 50.00
d
4 call xxx roadside assistance -4.44 25.00
t

decomposition. It would be difficult to produce such differing tran-

scripts using CTC due to the conditional independence assumptions, W
where the distribution of the output yi at time i is conditionally inde- r
pendent of distribution yi+1 at time i+1. Conventional DNN-HMM w
systems would require both spellings to be in the pronunciation dic- t
tionary to generate both transcriptions. t
b
5. RELATED WORK c
Image from: https://arxiv.org/pdf/1508.01211.pdf
i
Attention Distributions

Image from: https://arxiv.org/pdf/1508.01211.pdf

Doors & People Manual
100% (4)
Doors & People Manual
32 pages
Help Your Kids With Study Skills A Unique Step-By-Step Visual Guide (DK) (Z-Library)
100% (3)
Help Your Kids With Study Skills A Unique Step-By-Step Visual Guide (DK) (Z-Library)
256 pages
50 Psychology Classics
No ratings yet
50 Psychology Classics
3 pages
The 8 Clair Senses A Visual Guide To The Spiritual Psychic Senses
100% (11)
The 8 Clair Senses A Visual Guide To The Spiritual Psychic Senses
14 pages
Red-Black Game
No ratings yet
Red-Black Game
2 pages
A Safe, Natural Way To End Addiction
No ratings yet
A Safe, Natural Way To End Addiction
9 pages
Hybrid/Tandem Models + Tdnns + Intro To RNNS: Instructor: Preethi Jyothi
No ratings yet
Hybrid/Tandem Models + Tdnns + Intro To RNNS: Instructor: Preethi Jyothi
23 pages
Draft Dbn4lvcsr Transaslp
No ratings yet
Draft Dbn4lvcsr Transaslp
13 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
Speech Recognition With Deep Recurrent Neural Networks
No ratings yet
Speech Recognition With Deep Recurrent Neural Networks
2 pages
DOC-20250318-WA0029
No ratings yet
DOC-20250318-WA0029
24 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
Listen, Attend and Spell
No ratings yet
Listen, Attend and Spell
16 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
No ratings yet
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
5 pages
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
14 pages
DL4CV-Seq-Att
No ratings yet
DL4CV-Seq-Att
63 pages
Asru 2013 PDF
No ratings yet
Asru 2013 PDF
6 pages
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
5 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
A Comparative Study On Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition
No ratings yet
A Comparative Study On Selecting Acoustic Modeling Units in Deep Neural Networks Based Large Vocabulary Chinese Speech Recognition
6 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Attention Mechanism - High level overview
No ratings yet
Attention Mechanism - High level overview
11 pages
TASLP2339736 Proof 2
No ratings yet
TASLP2339736 Proof 2
26 pages
Hybrid Recurrent Neural Networks: An Application To Phoneme Classification
100% (1)
Hybrid Recurrent Neural Networks: An Application To Phoneme Classification
6 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018
No ratings yet
Sequence-To-Sequence Models: CIS 530, Computational Linguistics: Spring 2018
61 pages
Audio Adversarial Examples: Targeted Attacks On Speech-to-Text
No ratings yet
Audio Adversarial Examples: Targeted Attacks On Speech-to-Text
7 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
CTC Loss Function
No ratings yet
CTC Loss Function
20 pages
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
No ratings yet
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
38 pages
NN Vs HMM
No ratings yet
NN Vs HMM
4 pages
DL Decode Endsem
No ratings yet
DL Decode Endsem
71 pages
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
No ratings yet
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
16 pages
Voice Assistant (4)
No ratings yet
Voice Assistant (4)
34 pages
HMM-DNN Speech Recognition Techniques: A Review: ISSN: 2230-9926
No ratings yet
HMM-DNN Speech Recognition Techniques: A Review: ISSN: 2230-9926
5 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
No ratings yet
Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
5 pages
Convolutional Neural Networks For Sentence Classification: Yoon Kim New York University Yhk255@nyu - Edu
No ratings yet
Convolutional Neural Networks For Sentence Classification: Yoon Kim New York University Yhk255@nyu - Edu
6 pages
Language Model Evaluation in Open-Ended Text Gener
No ratings yet
Language Model Evaluation in Open-Ended Text Gener
70 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Speech Recognition
No ratings yet
Speech Recognition
188 pages
Deepnet Lourentzou
No ratings yet
Deepnet Lourentzou
49 pages
5th Unit
No ratings yet
5th Unit
36 pages
Seminarski
No ratings yet
Seminarski
10 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
2014 10 Cho EMNLP
No ratings yet
2014 10 Cho EMNLP
11 pages
A Speaker Verification Method Based On TDNN-LSTMP PDF
No ratings yet
A Speaker Verification Method Based On TDNN-LSTMP PDF
15 pages
Comparative Study of Optimization Algorithm in Deep CNN-Based Model For Sign Language Recognition - SpringerLink
No ratings yet
Comparative Study of Optimization Algorithm in Deep CNN-Based Model For Sign Language Recognition - SpringerLink
5 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
No ratings yet
Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks
5 pages
Slides on RNNs 26th March 2025
No ratings yet
Slides on RNNs 26th March 2025
30 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
No ratings yet
Machine Learning For Natural Language Processing Lecture Notes Columbia E6998 Itebooks download
42 pages
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
No ratings yet
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
12 pages
Te Belski S 1995
No ratings yet
Te Belski S 1995
190 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Gans + Final Practice Questions: Instructor: Preethi Jyothi
No ratings yet
Gans + Final Practice Questions: Instructor: Preethi Jyothi
28 pages
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
No ratings yet
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
34 pages
Lecture10 PDF
No ratings yet
Lecture10 PDF
40 pages
Pre-Midsem Revision: Instructor: Preethi Jyothi
No ratings yet
Pre-Midsem Revision: Instructor: Preethi Jyothi
35 pages
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
No ratings yet
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
36 pages
Tied-State HMMs + Introduction To NN-based AMs
No ratings yet
Tied-State HMMs + Introduction To NN-based AMs
37 pages
Theories
No ratings yet
Theories
2 pages
Perfil Sensorial Revision
91% (11)
Perfil Sensorial Revision
26 pages
11865
No ratings yet
11865
70 pages
Chapter-1_HUMSS-GROUP-1_121751
No ratings yet
Chapter-1_HUMSS-GROUP-1_121751
7 pages
The Science of Self-Compassion
100% (6)
The Science of Self-Compassion
14 pages
Methods and Strat of Teaching
No ratings yet
Methods and Strat of Teaching
10 pages
Developmental Issues For Young Children in Foster Care
No ratings yet
Developmental Issues For Young Children in Foster Care
8 pages
Dsds
No ratings yet
Dsds
10 pages
Portfolio Icpna Advanced 11
100% (1)
Portfolio Icpna Advanced 11
20 pages
Post Test Facilitating
No ratings yet
Post Test Facilitating
9 pages
Jurnal Five Finger
No ratings yet
Jurnal Five Finger
10 pages
United Neurodiagnostic Professionals of America - EMG and Evoked Potentials in The Operating Room During Spinal Surgery
No ratings yet
United Neurodiagnostic Professionals of America - EMG and Evoked Potentials in The Operating Room During Spinal Surgery
12 pages
Creation FromGodtoMantoAI
No ratings yet
Creation FromGodtoMantoAI
15 pages
Blangko_Assesssment_OT[1]
No ratings yet
Blangko_Assesssment_OT[1]
3 pages
Contextualized Learning Delivery Modalities
No ratings yet
Contextualized Learning Delivery Modalities
45 pages
Daftar Pustaka: Historical and Perennial Issues, 349
No ratings yet
Daftar Pustaka: Historical and Perennial Issues, 349
2 pages
CogNitron and NeoCogNitron
No ratings yet
CogNitron and NeoCogNitron
10 pages
Emotional Literacy The Heart of Classroom Management PDF
100% (3)
Emotional Literacy The Heart of Classroom Management PDF
161 pages
Amazon Team Leader Interview: Order ID: 0028913
No ratings yet
Amazon Team Leader Interview: Order ID: 0028913
3 pages
Intelligence Ppt Shylaja.m
100% (1)
Intelligence Ppt Shylaja.m
35 pages
Lesson 1 Taching Science
No ratings yet
Lesson 1 Taching Science
29 pages
Free PDF Brain Over Binge Basics CURRENT
No ratings yet
Free PDF Brain Over Binge Basics CURRENT
25 pages
Actor and The Target
No ratings yet
Actor and The Target
6 pages
Difference Between Hearing and Listening
No ratings yet
Difference Between Hearing and Listening
7 pages

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

Uploaded by

End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi

Uploaded by

End-to-end Neural Architectures

• CTC objective function is the probability of an

• Efficient forward+backward algorithm to

y′j for t = 1…T, when | x | = T

Image from: https://distill.pub/2017/ctc/

Spell (LAS), a • This on

where where is where the stateis (senone)

ei1 = f (si , h1 ) eij eiM

Speller We circumvent this problem by using a pyramidal BLSTM

In the pBLSTM model, we concatenate the outputs at consecutive

Model Clean WER Noisy WER

decomposition. It would be difficult to produce such differing tran-

Image from: https://arxiv.org/pdf/1508.01211.pdf

You might also like