Building Speech Recognition Systems With The Kaldi Toolkit PDF
Building Speech Recognition Systems With The Kaldi Toolkit PDF
60+ Contributors
Icon from hVp://thumbs.gograph.com
200
150
100
50
0 May-12
Jul-12
Nov-12
May-13
Jul-13
Nov-13
May-14
Jan-12
Jan-13
Jan-14
Mar-12
Sep-12
Mar-13
Sep-13
Mar-14
Meanwhile, Speech Search went from
“Solved” to “Unsolved” … Again
• NIST TREC SDR (1998)
– Spoken “document” retrieval from STT output as good
as retrieval from reference transcripts
– Speech search was declared a solved problem!
• NIST STD Pilot (2006)
– STT was found to be inadequate for spoken “term”
detec0on in conversa0onal telephone speech
• Limited language diversity in CTS corpora
– English Switchboard, Call Home and Fisher
– Arabic and Mandarin Chinese Call Home
In 2012, IARPA launched BABEL
One month ader Dan Povey returned to Kaldi’s birthplace
5
-3 +3 -3 +3 Layer 3
t-10 t-4 t-1 t+5
-1 +2 -1 +2 -1 +2 -1 +2 Layer 2
t-11 t-8 t-5 t-2 t+1 t+4 t+7
-2 +2 Layer 1
2 1
t-13 t+9
P ( s a _ b) = P ( s a _ ) F ( s _ b)
c ( as ) + λ2 P ( s )
P (s a _) =
c ( a ) + λ2
c ( sb) + λ3
F ( s _ b) =
∑ c (a! * b) P (s a! _ ) + λ
3
a!
Is “Prosody” Finally Helping STT?
Task Test Set Baseline + Sil/Pron
WSJ Eval 92 4.1 3.9
Switchboard Eval 2000 20.5 20.0
TED-LIUM Test 18.1 17.9
Test Clean 6.6 6.6
Libri Speech
Test Other 22.9 22.5
Cross-entropy training
T
θˆML = arg max ∑ log P (Ot Wt ;θ )
θ
t=1
(
KL P̂ Pθ )
Sequence training
# '
T % P (Ot Wt ;θ ) %
θˆMMI = arg max ∑ log $ ( I (W ∧O;θ )
θ
t=1 % ∑ P (Ot Wt!;θ ) P (Wt!) %
& Wt! )
Semi-Supervised Sequence Training
1 T P (Ot Wt ; θ ) 1 T P (Ot Wt ; θ )
I (W ∧O ; θ ) = ∑ log = ∑ log
T t=1 P (Ot ; θ ) T t=1 (
∑ P Ot W # ; θ ) P (W #)
W#
1 T
I (W ∧O ; θ ) = H (W ) − H (W O ; θ ) = H (W ) − ∑ H (W Ot ; θ )
T t=1
• Even ader that, language Text T 86.7 0.0 0.0 1.0 86.7
input and output layers work Shared DNN layers (except 2) 51.2% 45.0% 25.4%
+ Op0mized Language Model 50.8% 44.8% 25.4%
best + Dura0on Modeling 50.4% 44.3% 24.8%
Advanced Methods:
dh ax s aw n d
CTC, Explained … in Pictures
Figure from Graves et al, ICML 2006
β dh ax s aw n d β
β β β β β
DNN versus CTC: STT Performance
Figures and Tables from Sak et al, ICASSP 2015
T
θˆML = arg max ∑ log P (Ot Wt ;θ )
θ
t=1
(
KL P̂ Pθ )
# '
T % P (Ot Wt ;θ ) %
θˆMMI = arg max ∑ log $ ( I (W ∧O;θ )
θ
t=1 % ∑ P (Ot Wt!;θ ) P (Wt!) %
& Wt! )
Lance-Free MMI Training
• Denominator (phone) graph crea0on
– Use a phone 4-gram language model, L
– Compose H, C and L to obtain denominator graph
• This FSA is the same for all uVerances; suits GPU training
• Use (heuris0c) sentence-specific ini0al probabili0es
• Numerator graph crea0on
– Generate a phone graph using transcripts
• This FSA encodes frame-by-frame alignment of HMM states
– Permit some alignment “slack” for each frame/label
– Intersect slackened FSA with the denominator FSA
Lance-free MMI Training (cont’d)
• LSTM-RNNs trained with this MMI training procedure are
highly suscep0ble to over-finng
• Essen0al to regularize the NN training process
– A second output layer for CE training
– Output L2 regulariza0on
– Use a leaky HMM
Regulariza$on Hub-5 ‘00 Word Error Rate
Cross Entropy L2 Norm Leaky HMM Total SWBD
N N N 16.8% 11.1%
Y N N 15.9% 10.5%
N Y N 15.9% 10.4%
N N Y 16.4% 10.9%
Y Y N 15.7% 10.3%
Y N Y 15.7% 10.3%
N Y Y 15.8% 10.4%
Y Y Y 15.6% 10.4%
STT Results for Chain Models
300 hours of SWBD Training Speech; Hub-5 ‘00 Evalua0on Set
Training Objec$ve Model (Size) Total SWBD
WER WER
Cross-Entropy TDNN A (16.6M) 18.2% 12.5%
CE + sMBR TDNN A (16.6M) 16.9% 11.4%