0% found this document useful (0 votes)

318 views

Building Speech Recognition Systems With The Kaldi Toolkit PDF

This document provides an overview of the Kaldi speech recognition toolkit. It describes how Kaldi was created in 2009 and has since grown significantly, with over 60 contributors. It outlines the basic process for building a speech recognition system with Kaldi, including data preparation, GMM and DNN acoustic model training, decoding, and techniques for going beyond basic implementations. The document also discusses how Kaldi remains a leader through innovations like supporting low-resource languages and moving to GPUs and far-field microphones.

Uploaded by

Le Gia Anh Quy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

318 views

Building Speech Recognition Systems With The Kaldi Toolkit PDF

Uploaded by

Le Gia Anh Quy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

Building

Speech Recogni0on Systems

with the Kaldi Toolkit

Sanjeev Khudanpur, Dan Povey and Jan Trmal

Johns Hopkins University
Center for Language and Speech Processing
June 13, 2016
In the beginning, there was nothing
•  Then Kaldi was born in Bal0more, MD, in 2009.
Kaldi then grew up & became …

60+ Contributors
Icon from hVp://thumbs.gograph.com

Pos$ngs to Discussion List

250

200

150

100

0 May-12
Jul-12

Nov-12

May-13
Jul-13

Nov-13

May-14
Jan-12

Jan-13

Jan-14
Mar-12

Sep-12

Mar-13

Sep-13

Mar-14
Meanwhile, Speech Search went from
“Solved” to “Unsolved” … Again
•  NIST TREC SDR (1998)
–  Spoken “document” retrieval from STT output as good
as retrieval from reference transcripts
–  Speech search was declared a solved problem!
•  NIST STD Pilot (2006)
–  STT was found to be inadequate for spoken “term”
detec0on in conversa0onal telephone speech
•  Limited language diversity in CTS corpora
–  English Switchboard, Call Home and Fisher
–  Arabic and Mandarin Chinese Call Home
In 2012, IARPA launched BABEL
One month ader Dan Povey returned to Kaldi’s birthplace

•  Automa0c transcrip0on of conversa0onal

telephone speech was s0ll the core challenge.
•  But with a few subtle, crucial changes
–  Focused aVen0on on low-resource condi0ons
–  Required concurrent progress in mul0ple languages
•  PY1: Cantonese, Tagalog, Pashto, Turkish and Vietnamese
•  PY2: Assamese, Bengali, Hai0an Creole, Lao, Zulu and Tamil
–  Reduced system development 0me from year to year
–  Used keyword search metrics to measure progress
Kaldi Today
A community of Researchers Coopera0vely Advancing STT
•  C++ library, command-line tools, STT “recipes”
–  Freely available via GitHub (Apache 2.0 license)
•  Top STT performance in open benchmark tests
–  E.g. NIST OpenKWS (2014) and IARPA ASpIRE (2015)
•  Widely adopted in academia and industry
–  300+ cita0ons in 2014 (based on Google scholar data)
–  400+ cita0ons in 2015 (based on Google scholar data)
–  Used by several US and non-US companies
•  Main “trunk” maintained by Johns Hopkins
–  Forks contain specializa0ons by JHU and others
Co-PI’s, PhD Students and Sponsors
•  Sanjeev Khudanpur
•  Daniel Povey
•  Jan Trmal
•  Guoguo Chen
•  Pegah Ghahremani
•  Vimal Manohar
•  Vijayaditya Peddin0
•  Hainan Xu
•  Xiaohui Zhang
•  and several others
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
Senng up Paths, Queue Commands, …
Building an STT System with Kaldi
•  Data prepara0on
–  Acous$c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
Preparing Acous0c Training Data
data/train/text
data/train/wav.scp
data/train/(uV2spk|spk2uV)
data/train/(cmvn.scp|feats.scp)
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia$on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
Preparing the Pronuncia0on Lexicon
data/local/dict/lexicon.txt
data/local/dict/*silence*.txt
data/local/lang
Word Boundary Tags
Disambigua0on Symbols
data/lang
data/lang/(phones|words).txt
data/lang/topo
data/lang/phones/roots.txt
data/lang/phones/extra_ques0ons.txt
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
Preparing the Language Model
local/train_lms_srilm.sh
local/train_lms_srilm.sh (cont’d)
Interpolated Language Models
local/arpa2G.sh
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous$c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
GMM Training (1)
GMM Training (2)
cluster-phones, compile-ques0ons,
build-tree
GMM Training (4)
GMM Training (5)
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea$ng a sta$c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
Building HCLG (1)
Building HCLG (2)
Building HCLG (3)
Building HCLG (4)
Decoding and Lance Rescoring
steps/decode_sgmm2.sh
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  LaAce rescoring
•  Basic DNN system building
•  Going beyond the basics
steps/lmrescore_const_arpa.sh
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics
local/nnet3/run_ivector_common.sh
steps/nnet3/tdnn/make_conﬁgs.py
steps/nnet3/train_dnn.py
Building an STT System with Kaldi
•  Data prepara0on
–  Acous0c model training data
–  Pronuncia0on lexicon
–  Language model training data
•  Basic GMM system building
–  Acous0c model training
–  Language model training
•  Basic Decoding
–  Crea0ng a sta0c decoding graph
–  Lance rescoring
•  Basic DNN system building
•  Going beyond the basics

Advanced Methods:

Staying Ahead in the STT Game

Language Cantonese Tagalog Pashto Turkish

Speech 80h 10h 80h 10h 80h 10h 80h 10h
CER/WER 48.5% 61.2% 46.3% 61.9% 50.7% 63.0% 51.3% 65.3%
ATWV 0.47 0.26 0.56 0.28 0.46 0.25 0.52 0.25

Advanced Methods:

Staying Ahead in the STT Game

•  STT technology is advancing very rapidly
–  Amazon, Apple, Baidu, Facebook, Google, Microsod
•  Kaldi leads and keeps up with major innova0ons
–  From SGMMs to DNN (2012)
–  From “English” to low-resource languages (2013)
–  From CPUs to GPUs (2014)
–  From close-talking to far-ﬁeld microphones (2015)
–  From well-curated to “wild type” corpora (2016)
•  A preview of some upcoming developments
Parallel (GPU-based) Training
•  Original neural network training algorithms
were inherently sequen0al (e.g. SGD)
•  Scaling up to “big data” becomes a challenge
•  Several solu0ons have emerged recently
–  2009: Delayed SGD (Yahoo!)
–  2011: Lock-free SGD (Hogwild! U Wisconsin)
–  2012: Gradient averaging (DistBelief, Google)
–  2014: Model averaging (NG-SGD, Kaldi)
Model Averaging with NG-SGD
•  Train DNNs with large amount of data
–  U0lize a cluster of CPUs or GPUs
–  Minimize network traﬃc (esp. for CPUs)
•  Solu0on: exploit data paralleliza0on
–  Update model in parallel over many mini-batches
–  Infrequently average models (parameters)
•  Use “Natural-Gradient” SGD for model upda0ng
–  Approximates condi0oning via inverse Fisher matrix
–  Improves convergence even without paralleliza0on
Paralleliza0on MaVers!

•  Typically, a GPU is 10x faster than a 16 core CPU

•  Linear speed-up 0ll ca 4 GPUs, then diminishing

Advanced Methods:

Staying Ahead in the STT Game

•  IARPA’s Automa0c Speech recogni0on In Reverberant

Environments (ASpIRE) Challenge is seeking that grail.
Rules of the ASpIRE Challenge
•  15 hours of speech data were posted on the IARPA website
–  Mul0-microphone recordings of conversa0onal English
–  5h development set (dev), 10h development-test set (dev-test)
–  Transcrip0ons provided for dev, only scoring for dev-test output
–  For training data selec0on, system development and tuning
•  12 hours of new speech data during the evalua0on period
–  Far-field speech (eval) from noisy, reverberant rooms
–  Single-microphone or mul0-microphone condi0ons
•  Word error rate is the measure of performance
–  Single-microphone submissions were due on 02/18/2015
–  Results were officially announced on 09/10/2015
Examples of ASpIRE Audio
•  Typical sample
–  Suggested by Dr. Mary Harper
•  Almost manageable
–  Easy for humans, 26% errors for ASR
•  Somewhat hard
–  Easy for humans, 41% errors for ASR
•  Much harder
–  Not easy for humans, 60% errors for ASR
•  *#@! !#% #%^^
–  Very hard for humans, no ASR output
Kaldi ASR Improvements for ASpIRE
•  Time delay neural networks (TDNN)
–  A way to deal with long acous0c-phone0c context
–  A structured alterna0ve to deep/recurrent neural nets
•  Data augmenta$on with simulated reverbera0ons
–  A way to mi0gate channel distor0ons not seen in training
–  A form of mul0-condi0on training of ASR models
•  i-vector based speaker & environment adapta$on
–  A way to deal with speaker & channel variability
–  Adapted [with a twist] from Speaker ID systems
Kaldi ASR Improvements, ASpIRE++
•  Pronuncia$on and inter-word silence modeling
–  Inspired by pronuncia0on-prosody interac0ons
–  A simple context-dependent model of inter-word silence
•  Recurrent neural network language models (RNNLM)
–  A (known) way to model long-range word dependencies
–  Incorporated post-submission into JHU ASpIRE system
•  Ongoing Kaldi inves0ga0ons that hold promise
–  Semi-supervised discrimina0ve training of (T)DNNs
–  Long short-term memory (LSTM) acous0c models
–  Connec0onist temporal classifica0on (CTC) models
Time Delay Neural Networks
(See our paper at INTERSPEECH 2015 for details)
A 28 Year Old Idea, Resurrected
3
-7 +2 Layer 4
t-7 t+2

5
-3 +3 -3 +3 Layer 3
t-10 t-4 t-1 t+5

-1 +2 -1 +2 -1 +2 -1 +2 Layer 2
t-11 t-8 t-5 t-2 t+1 t+4 t+7

-2 +2 Layer 1

2 1
t-13 t+9

Alex Waibel, Kevin Lang, et al (1987) Our TDNN Architecture (2015)

Improved ASR on Several Data Sets
Standard ASR Test Sets Size DNN TDNN Rel. Δ
Wall Street Journal 80 hrs 6.6% 6.2% 5%
TED-LIUM 118 hrs 19.3% 17.9% 7%
Switchboard 300 hrs 15.5% 14.0% 10%
Libri Speech 960 hrs 5.2% 4.8% 7%
Fisher English 1800 hrs 22.2% 21.0% 5%

•  Consistent 5-10% reduc0on in word error rate (WER) over

DNNs on most datasets, including conversa0onal speech.
•  TDNN training speeds are on par with DNN, and nearly an
order of magnitude faster than RNN
ASpIRE (Fisher Training) 1800 hrs 47.7% 47.6%
Data Augmenta0on for ASR Training
(See our paper at INTERSPEECH 2015 for details)
Simula0ng Reverberant Speech for
Mul0-condi0on (T)DNN Training
•  Simulate ca 5500 hours of reverberant, noisy data from
1800 hours of the Fisher English CTS corpus
–  Replicate each of the ca 21,000 conversa0on sides 3 0mes
–  Randomly change the sampling rate [up to ±10%]
–  Convolve each conversa0on side with one of 320 real-life
room impulse responses (RIR) chosen at random
–  Add noise to the signal (when available with the RIR)
•  Generate (T)DNN training labels from clean speech
–  Align “pre-reverb” speech to ca 7500 CD-HMM states
•  Train DNN and TDNN acous0c models
–  Cross-entropy training followed by sequence training
Result of Data Augmenta0on
Acous$c Model Data Augmenta$on Dev WER
TDNN A (230 ms) None (1800h, clean speech) 47.6%
TDNN A (230 ms) + 3 x (reverbera0on + noise) 31.7%
TDNN B (290 ms) + 3 x (reverbera0on + noise) 30.8%
TDNN A (230 ms) + sampling rate perturba0on 31.0%
TDNN B (290 ms) + sampling rate perturba0on 31.1%

•  Data augmenta0on with simulated reverbera0on is beneﬁcial

–  Likely to be a very important reason for rela0vely good performance
•  Sampling rate perturba0on didn’t help much on ASpIRE data
•  Sequence training helped reduce WER on the dev set
–  Required modifying the sMBR training criterion to realize gains
–  But the gains did not carry over to dev-test set
i-vectors for Speaker Compensa0on
(See our paper at INTERSPEECH 2015 for details)
Using i-vectors Instead of fMLLR
and using unnormalized MFCCs to compute i-vectors
•  100-dim i-vectors are appended to MFCC inputs of the TDNN
–  i-vectors are computed from raw MFCCs (i.e. no mean subtrac0on etc)
–  UBM posteriors however use MFCCs normalized over a 6 sec window
•  i-vectors are computed for each training uVerance
–  Increases speaker- and channel variability seen in training data
–  May model transient distor0ons? e.g. moving speakers, passing cars
•  i-vectors are calculated for every ca 60 sec of test audio
–  UBM prior is weighted 10:1 to prevent overcompensa0on
–  Weight of test sta0s0cs is capped at 75:1 rela0ve to UBM sta0s0cs

Speaker Compensa$on Method Dev WER

TDNN without i-vectors 34.8%
+ i-vectors (from all frames) 33.8%
+ i-vectors (from reliable speech frames) 30.8%
Pronuncia0on and Silence Probabili0es
(See our paper at INTERSPEECH 2015 for details)
Trigram-like Inter-word Silence Model

P ( s a _ b) = P ( s a _ ) F ( s _ b)

c ( as ) + λ2 P ( s )
P (s a _) =
c ( a ) + λ2

c ( sb) + λ3
F ( s _ b) =
∑ c (a! * b) P (s a! _ ) + λ
3
a!
Is “Prosody” Finally Helping STT?
Task Test Set Baseline + Sil/Pron
WSJ Eval 92 4.1 3.9
Switchboard Eval 2000 20.5 20.0
TED-LIUM Test 18.1 17.9
Test Clean 6.6 6.6
Libri Speech
Test Other 22.9 22.5

•  Modeling pronuncia0on and silence probabili0es yields modest but

consistent improvement on many large vocabulary ASR tasks

Pronuncia$on/Silence Probabili$es Dev WER

No probabili0es in the lexicon 32.1%
+ pronuncia0on probabili0es 31.6%
+ inter-word silence probabili0es 30.8%
Recurrent Neural Network based
Language Models
(See our paper at INTERSPEECH 2010 for the ﬁrst “convincing” results)
RNN LM on ASpIRE Data
Language Model and Rescoring Method Dev WER
4-gram LM and lance rescoring 30.8%
RNN-LM and 100-best rescoring 30.2%
RNN-LM and 1000-best rescoring 29.9%
RNN-LM (4-gram approxima0on) lance rescoring 29.9%
RNN-LM (6-gram approxima0on) lance rescoring 29.8%

•  An RNN LM consistently outperforms the N-gram LM

•  The Kaldi lance rescoring appears to cause no loss in
performance
–  Approxima0on entails not “expanding” the lance to represent
each unique history separately
–  When two paths merge in an N-gram lance, only one s(t) is
chosen at random and propagated forward
The IARPA ASpIRE Leader Board

Rank Par$cipant Dev WER System Type

1 tsakilidis 27.2% Combina0on
2 rhsiao 27.5% Combina0on
3 vijaypeddin0 27.7% Single System
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
hVp://www.dni.gov/index.php/newsroom/press-releases/210-press-releases-2015/1252-iarpa-announces-winners-of-its-aspire-challenge
Performance on Evalua0on Data
Par$cipant Test WER System Type
Kaldi 44.3% Single System
BBN (and others) 44.3% Combina0on
I2R (Singapore) 44.8% Combina0on

Dev Test Eval

Acous$c Model Language Model
WER WER WER
TDNN B (CE training) 4-gram 30.8% 27.7% 44.3%
TDNN B (sMBR training) 4-gram 29.1% 28.9% 43.9%
TDNN B (CE training) RNN 29.8% 26.5% 43.4%
TDNN B (sMBR training) RNN 28.3% 28.2% 43.4%
Keys to Good Performance on ASpIRE
•  Time delay neural networks (TDNN)
–  Deal well with long reverbera0on 0mes
•  i-vector based adapta0on compensa0on
–  Deals with speaker & channel variability
•  Data augmenta0on with simulated reverbera0ons
–  Deals with channel distor0ons not seen in training
•  Pronuncia0on and inter-word silence probabili0es
–  Helpful in adverse acous0c condi0ons
The JHU ASpIRE System
(See our ASRU 2015 paper for details)
Semi-supervised MMI Training
(See our paper at INTERSPEECH 2015 for details)
Discrimina0ve (MMI) Training:
a hand-waving, mostly correct introduc0on

Cross-entropy training
T
θˆML = arg max ∑ log P (Ot Wt ;θ )
θ
t=1
(
KL P̂ Pθ )

Sequence training
# '
T % P (Ot Wt ;θ ) %
θˆMMI = arg max ∑ log $ ( I (W ∧O;θ )
θ
t=1 % ∑ P (Ot Wt!;θ ) P (Wt!) %
& Wt! )
Semi-Supervised Sequence Training

•  Sequence training improves substan0ally over basic

cross-entropy training of DNN acous0c models
•  Semi-supervised cross-entropy training – by adding
unlabeled data – also improves substan0ally over
basic cross-entropy training on labeled data
•  But semi-supervised sequence training is “tricky”
–  Sensi0vity to incorrect transcrip0on seems greater
–  Conﬁdence-based ﬁltering or weigh0ng must be applied
–  Empirical results are not very sa0sfactory
Semi-supervised Sequence Training:
without comminng to a single transcrip0on

•  View MMI training as minimizing a condi0onal entropy

1 T P (Ot Wt ; θ ) 1 T P (Ot Wt ; θ )
I (W ∧O ; θ ) = ∑ log = ∑ log
T t=1 P (Ot ; θ ) T t=1 (
∑ P Ot W # ; θ ) P (W #)
W#

1 T
I (W ∧O ; θ ) = H (W ) − H (W O ; θ ) = H (W ) − ∑ H (W Ot ; θ )
T t=1

•  The laVer does not require comminng to a single Wt

–  Well suited for unlabeled speech
–  Entails compu0ng a sum over all W’s in the lance
Compu0ng Lance Entropy Using
Expecta0on Semi-rings
•  How to eﬃciently compute −H (W Ot ; θ ) = ∑ P (π ) log P (π )
π ∈L

•  Take inspira0on from the computa0on of Z ( Ot ; θ ) = ∑ P (π )

π ∈L

•  Replace arc-probabili0es pi with the pair ( pi , pi log{pi} )

Semi-ring Element & Operators ( p , p×log{p} )
( p1 , p1log{p1} ) + ( p2 , p2log{p2} ) ( p1+p2 , p1log{p1}+p2log{p2} )
( p1 , p1log{p1} ) × ( p2 , p2log{p2} ) ( p1p2 , p1p2log{p2}+p2p1log{p1} )
Semi-supervised Sequence Training:
Key Details Needed to Make it Work

•  View training criterion as MCE instead of MMI

–  i.e. arg min H(W|O;θ) instead of arg max I(W∧O;θ)
–  Eﬃciently compute H(W|O;θ) for the lance, and its
gradient, via Baum Welch with special semi-rings
•  Use separate output (sod-max) layers in the DNN
for labeled and unlabeled data
–  Inspired by mul0lingual DNN training methods
•  Use a slightly diﬀerent “prior” for conver0ng DNN
posterior probabili0es to acous0c likelihoods
Results for Semi-Supervised MMI
on Fisher English CTS
unlabeled data

DNN Training Method (hours of speech) Dev WER Test WER

Known use of

Cross-Entropy Training (100h labeled) 32.0 31.2

CE (100h labeled + 250h self-labeled) 30.6 29.8
CE (100h labeled + 250h weighted) 30.5 29.8
unlabeled data

Sequence Training (100h labeled) 29.6 28.5

Bejer use of

Seq Training (100h labeled +250h weighted) 29.9 28.8

Seq Training (100h labeled + 250h MCE) 29.4 28.1
Sequence Training (350h labeled) 28.5 27.5

•  Recovers about 40% of the supervised training gain

–  Inves0ga0on underway for 2000h of unlabeled speech
•  Repeatable results on BABEL datasets with 10h
supervised training + 50-70h unsupervised

Advanced Methods:

Staying Ahead in the STT Game

•  STT technology is advancing very rapidly
–  Amazon, Apple, Baidu, Facebook, Google, Microsod
•  Kaldi leads and keeps up with major innova0ons
–  From SGMMs to DNN (2012)
–  From “English” to low-resource languages (2013)
–  From CPUs to GPUs (2014)
–  From close-talking to far-field microphones (2015)
–  From well-curated to “wild type” corpora (2016)
•  A preview of some upcoming developments
Heterogeneous Training Corpora
•  Transcribed speech from different collec0ons are not easy
to merge for STT training
–  Genre and speaking style differences
–  Different channel condi0ons
–  Slightly different transcrip0on conven0ons
•  Typical result: the corpus matched to test data gives best
STT results; others don’t help, some0mes hurt!
•  SCALE 2015 case study with Pashto CTS
–  Collected in country, and transcribed, by same vendor
–  Roughly 80 hours each in the
•  Appen LILA corpus and IARPA BABEL corpus
–  Pronuncia0on lexicon to cover transcripts; same phone set
A Study in Pashto
(A manuscript is in prepara0on for future publica0on)
A Study in Pashto
•  Transcrip0ons require Training
Data
Single
Model
Interpola$on Weights
LM A LM B LM T
Interpolated
Model
extensive cross-corpus Text A 99.2 0.8 0.2 0.0 92.9
normaliza0on Text B 141.9 0.1 0.8 0.1 140.0

•  Even ader that, language Text T 86.7 0.0 0.0 1.0 86.7

models don’t beneﬁt much

from corpus pooling DNN Training Data STT Word Error Rates
Test A Test B Test T
•  Simple corpus pooling doesn’t Single corpus (matched) 55.4% 46.8% 24.8%
improve acous0c modeling Two corpora (Pashto A + B) 51.9% 48.2% 52.6%
either
•  DNNs with shared “inner” Mul$-corpus (A+B) Training
Strategy
STT Word Error Rates
Test A Test B Test T
layers and corpus-speciﬁc Shared DNN layers (except 1) 53.2% 47.4% 27.0%

input and output layers work Shared DNN layers (except 2) 51.2% 45.0% 25.4%
+ Op0mized Language Model 50.8% 44.8% 25.4%
best + Dura0on Modeling 50.4% 44.3% 24.8%

Advanced Methods:

Staying Ahead in the STT Game

•  STT technology is advancing very rapidly
–  Amazon, Apple, Baidu, Facebook, Google, Microsod
•  Kaldi leads and keeps up with major innova0ons
–  From SGMMs to DNN (2012)
–  From “English” to low-resource languages (2013)
–  From CPUs to GPUs (2014)
–  From close-talking to far-field microphones (2015)
–  From well-curated to “wild type” corpora (2016)
•  A preview of some upcoming developments
Other Addi0ons and Innova0ons
•  Semi-supervised (MMI) training
–  Using unlabeled speech to augment a limited
transcribed speech corpus
•  Mul0lingual acous0c model training
–  Using other-language speech to augment a limited
transcribed speech corpus
•  Removing reliance on pronuncia0on lexicons
–  Grapheme based models and acous0cally aided G2P
•  Chain models
–  10% more accurate STT, plus
–  3x faster decoding, and 5x-10x faster training
The Genesis of Chain Models
•  Connec0onist Temporal Classifica0on
–  The latest shiny toy in neural network-based acous0c
modeling for STT (ICASSP and InterSpeech 2015)
–  Nice STT improvements shown on Google datasets
–  We haven’t seen STT gains on our datasets
•  Chain Models
–  Inspired by (but quite different from) CTC
–  Sequence training of NNs without CE pre-training
–  Nice STT improvements over previous best systems
–  3x decoding 0me speed-up; 5x-10x training speed-up
2006: A New Kid on the NNet Block
2015: The New Kid Comes of Age
2015: The New Kid Comes of Age
CTC, Explained … in Pictures
Figure from Graves et al, ICML 2006

dh ax s aw n d
CTC, Explained … in Pictures
Figure from Graves et al, ICML 2006

β dh ax s aw n d β

β β β β β
DNN versus CTC: STT Performance
Figures and Tables from Sak et al, ICASSP 2015

DNN Target CE sMBR

LSTM Senone 10.0% 8.9%
BLSTM Senone 9.7% 9.1%

CTC Target CE sMBR

LSTM Phone 10.5% 9.4%
BLSTM Phone 9.5% 8.5%
First, the Bad News …
•  We haven’t been able to get CTC models to
give us any no0ceable improvement over our
best (TDNN or LSTM-RNN) models on our data
–  It appears to be easier to get them to work when
one has several 1000 hours of labeled speech
–  But we care about lower-resource scenarios
… and then the Good News
•  We are able to get similar improvements using
a diﬀerent model, which is inspired by ideas
from the CTC papers
–  Use simple “1-state” HMMs for each CD phone
–  Reduce frame rate from 100 Hz to 33 Hz
–  Permit slack in the frame-to-state alignment
Chain Models and LF-MMI Training
•  A new class of acous0c models for hybrid STT
–  “1-state” HMM for each context-dependent phone
–  LSTM/TDNNs compute state posterior probabili0es
•  MFCCs are down-sampled from 100Hz to 33Hz
–  Inspired by CTC
•  A new lance-free MMI training method
–  Improved paralleliza0on, sequence training on GPUs
•  Larger mini-batches, smaller I/O bandwidth
–  Does not require CE training before MMI training
–  Uses “ﬂexible label alignment” inspired by CTC
Discrimina0ve (MMI) Training:
a hand-waving, mostly correct introduc0on

T
θˆML = arg max ∑ log P (Ot Wt ;θ )
θ
t=1
(
KL P̂ Pθ )

# '
T % P (Ot Wt ;θ ) %
θˆMMI = arg max ∑ log $ ( I (W ∧O;θ )
θ
t=1 % ∑ P (Ot Wt!;θ ) P (Wt!) %
& Wt! )
Lance-Free MMI Training
•  Denominator (phone) graph crea0on
–  Use a phone 4-gram language model, L
–  Compose H, C and L to obtain denominator graph
•  This FSA is the same for all uVerances; suits GPU training
•  Use (heuris0c) sentence-speciﬁc ini0al probabili0es
•  Numerator graph crea0on
–  Generate a phone graph using transcripts
•  This FSA encodes frame-by-frame alignment of HMM states
–  Permit some alignment “slack” for each frame/label
–  Intersect slackened FSA with the denominator FSA
Lance-free MMI Training (cont’d)
•  LSTM-RNNs trained with this MMI training procedure are
highly suscep0ble to over-ﬁnng
•  Essen0al to regularize the NN training process
–  A second output layer for CE training
–  Output L2 regulariza0on
–  Use a leaky HMM
Regulariza$on Hub-5 ‘00 Word Error Rate
Cross Entropy L2 Norm Leaky HMM Total SWBD
N N N 16.8% 11.1%
Y N N 15.9% 10.5%
N Y N 15.9% 10.4%
N N Y 16.4% 10.9%
Y Y N 15.7% 10.3%
Y N Y 15.7% 10.3%
N Y Y 15.8% 10.4%
Y Y Y 15.6% 10.4%
STT Results for Chain Models
300 hours of SWBD Training Speech; Hub-5 ‘00 Evalua0on Set
Training Objec$ve Model (Size) Total SWBD
WER WER
Cross-Entropy TDNN A (16.6M) 18.2% 12.5%
CE + sMBR TDNN A (16.6M) 16.9% 11.4%

TDNN A (9.8M) 16.1% 10.7%

Lance-free MMI TDNN B (9.9M) 15.6% 10.4%
TDNN C (11.2M) 15.5% 10.2%
LF-MMI + sMBR TDNN C (11.2M) 15.1% 10.0%

•  LF-MMI reduces WER by ca 10%-15% rela>ve

•  LF-MMI is beVer than standard CE + sMBR training (ca 8%)
•  LF-MMI improves very slightly with addi0onal sMBR training
Chain Models and LF-MMI Training
STT Performance on a Variety of Corpora

Corpus and Training CE + sMBR LF-MMI

Audio Type Speech Error Rate Error Rate
AMI IHM 80 hours 23.8% 22.4%
AMI SDM 80 hours 48.9% 46.1%
TED-LIUM 118 hours 11.3% 12.8%
Switchboard 300 hours 16.9% 15.5%
Fisher + SWBD 2100 hours 15.0% 13.3%

•  Chain models with LF-MMI reduce WER by 6%-11% (rela>ve)

•  LF-MMI improves a bit further with addi0onal sMBR training
•  FL-MMI is 5x-10x faster to train, 3x faster to decode
A Recap of Chain Models
•  A new class of acous0c models for hybrid STT
–  “1-state” HMM for context-dependent phones
–  LSTM-RNN acous0c models (TDNN also compa0ble)
•  A new lance-free MMI training method
–  BeVer suited to using GPUs for paralleliza0on
–  Does not require CE training before MMI training
•  Improved speed and STT performance
–  6%-8% rela0ve WER reduc0on over previous best
–  5-10x improvement in training 0me; 3x decoding 0me

Summary of Advanced Methods:

Staying Ahead in the STT Game

Linguistic Variation and Change On The Historical Sociolinguistics of English
No ratings yet
Linguistic Variation and Change On The Historical Sociolinguistics of English
260 pages
Monika Bednarek and Helen Caple - News Discourse-Continuum (2012)
100% (2)
Monika Bednarek and Helen Caple - News Discourse-Continuum (2012)
289 pages
Bardovi Harlig, K. (2019) .
No ratings yet
Bardovi Harlig, K. (2019) .
7 pages
Milroy (1985) - Linguistic Change, Social Network and Speaker Innovation
No ratings yet
Milroy (1985) - Linguistic Change, Social Network and Speaker Innovation
47 pages
Japanese Cybercultures
No ratings yet
Japanese Cybercultures
271 pages
COUPLAND, N. Language, Situation, and The Relational Self - Theorizing Dialect-Style in Sociolinguistics
No ratings yet
COUPLAND, N. Language, Situation, and The Relational Self - Theorizing Dialect-Style in Sociolinguistics
26 pages
38 - Working With Discourse Martin and Rose
No ratings yet
38 - Working With Discourse Martin and Rose
63 pages
Collocations and Semantic Prosody
No ratings yet
Collocations and Semantic Prosody
26 pages
World Englishes - 2024 - Uzum - Epistemic modality and evidentiality in virtual intercultural exchanges between Turkish and
No ratings yet
World Englishes - 2024 - Uzum - Epistemic modality and evidentiality in virtual intercultural exchanges between Turkish and
18 pages
Some Practices For Referring To Persons in Talk-in-Interaction
No ratings yet
Some Practices For Referring To Persons in Talk-in-Interaction
51 pages
Slobin - Typology of Motion Events
No ratings yet
Slobin - Typology of Motion Events
32 pages
Ashurova D.U. Galieva M.R. Cultural Linguistics - Book 3
No ratings yet
Ashurova D.U. Galieva M.R. Cultural Linguistics - Book 3
126 pages
Haugen Language Planning Standardization
0% (1)
Haugen Language Planning Standardization
13 pages
Accent Handbook Paper Revised
No ratings yet
Accent Handbook Paper Revised
50 pages
Paul Martin Postal - Three Investigations of Extraction
No ratings yet
Paul Martin Postal - Three Investigations of Extraction
229 pages
Evaluation in Media Discourse Analysis of A Newspaper Corpus PDF
0% (1)
Evaluation in Media Discourse Analysis of A Newspaper Corpus PDF
2 pages
(Stefan Hauser (Ed.), Martin Luginbühl (Ed.) ) Con
No ratings yet
(Stefan Hauser (Ed.), Martin Luginbühl (Ed.) ) Con
255 pages
(Linguistic Inquiry Monographs) Alec Marantz-On The Nature of Grammatical Relations (Linguistic Inquiry Monographs, 10) - The MIT Press (1984)
No ratings yet
(Linguistic Inquiry Monographs) Alec Marantz-On The Nature of Grammatical Relations (Linguistic Inquiry Monographs, 10) - The MIT Press (1984)
355 pages
Essays Toward Realistic Syntax PDF
No ratings yet
Essays Toward Realistic Syntax PDF
213 pages
Get Narrative A Critical Linguistic Introduction 2nd Edition Michael Toolan Free All Chapters
100% (20)
Get Narrative A Critical Linguistic Introduction 2nd Edition Michael Toolan Free All Chapters
84 pages
79324068433
No ratings yet
79324068433
3 pages
World Englishes - 2014 - SCHNEIDER - New Reflections On The Evolutionary Dynamics of World Englishes
No ratings yet
World Englishes - 2014 - SCHNEIDER - New Reflections On The Evolutionary Dynamics of World Englishes
24 pages
Labov - Structural Analysis of Narrative
100% (1)
Labov - Structural Analysis of Narrative
27 pages
The Concept of Subjectivity in Language
No ratings yet
The Concept of Subjectivity in Language
24 pages
Download Full The discourse of news values : how news organizations create newsworthiness 1st Edition Bednarek PDF All Chapters
No ratings yet
Download Full The discourse of news values : how news organizations create newsworthiness 1st Edition Bednarek PDF All Chapters
62 pages
Dennis R. Preston - Handbook of Perceptual Dialectology Volume 1 (1999)
No ratings yet
Dennis R. Preston - Handbook of Perceptual Dialectology Volume 1 (1999)
454 pages
Colloquial Style The Last
100% (1)
Colloquial Style The Last
13 pages
A Study of Conversational Structure Paper Submitted To Fulfill The Requirements of Pragmatics Final Exam
No ratings yet
A Study of Conversational Structure Paper Submitted To Fulfill The Requirements of Pragmatics Final Exam
18 pages
Sapir (1929) Status of Linguistics As A Science
No ratings yet
Sapir (1929) Status of Linguistics As A Science
9 pages
Effects of Native-Language and Sex On Back-Channel Behavior - Feke
No ratings yet
Effects of Native-Language and Sex On Back-Channel Behavior - Feke
12 pages
Linguistic Repertoire
100% (1)
Linguistic Repertoire
7 pages
Catford JC 1975 Ergativity in Caucasian Languagues
No ratings yet
Catford JC 1975 Ergativity in Caucasian Languagues
59 pages
From Polyseme To Ling ChangeFrancoisSemanticMaps
No ratings yet
From Polyseme To Ling ChangeFrancoisSemanticMaps
29 pages
[Ebooks PDF] download Atoms Of Language The Mind's Hidden Rules Of Grammar by Mark C. Baker The full chapters
100% (3)
[Ebooks PDF] download Atoms Of Language The Mind's Hidden Rules Of Grammar by Mark C. Baker The full chapters
24 pages
M Lynne Murphy 2003 Semantic Relations A
No ratings yet
M Lynne Murphy 2003 Semantic Relations A
6 pages
(Lesley Milroy and Pieter Muysken) One Speaker, Two Languages
No ratings yet
(Lesley Milroy and Pieter Muysken) One Speaker, Two Languages
330 pages
4 Lexicology and Lexicography
No ratings yet
4 Lexicology and Lexicography
20 pages
The American Dialect Society
No ratings yet
The American Dialect Society
17 pages
Do PDF
No ratings yet
Do PDF
434 pages
Matthiessen - 2011 - Systemic Functional Linguistic Approaches ToCritical Discourse Analysis - For - Delta - Corrected
No ratings yet
Matthiessen - 2011 - Systemic Functional Linguistic Approaches ToCritical Discourse Analysis - For - Delta - Corrected
29 pages
From Mandarin To Cantonese Lexicography
No ratings yet
From Mandarin To Cantonese Lexicography
33 pages
Defining Developing and Assessing Intercultural Competence
No ratings yet
Defining Developing and Assessing Intercultural Competence
11 pages
Cognitive Semantics: Prepared By: Maricar Talosig Angel Ann Masiddo
No ratings yet
Cognitive Semantics: Prepared By: Maricar Talosig Angel Ann Masiddo
6 pages
Linguistic Rituals For Thanking in Japanese: Balancing Obligations
No ratings yet
Linguistic Rituals For Thanking in Japanese: Balancing Obligations
25 pages
The Linguistics of Sitting, Standing and Lying
No ratings yet
The Linguistics of Sitting, Standing and Lying
422 pages
Jan Harold Brunvand-Readings in American Folklore
No ratings yet
Jan Harold Brunvand-Readings in American Folklore
484 pages
Sociolinguistics
No ratings yet
Sociolinguistics
7 pages
Collecting Sociolinguistic Data
No ratings yet
Collecting Sociolinguistic Data
12 pages
Untitled
100% (1)
Untitled
185 pages
Understanding Phonology
No ratings yet
Understanding Phonology
5 pages
The Indonesian Language By: Indonesia, Literally: "The Language (Bahasa) of Indonesia". in English We Call The
No ratings yet
The Indonesian Language By: Indonesia, Literally: "The Language (Bahasa) of Indonesia". in English We Call The
6 pages
Paul Bloom Et Alii Language and Space 1996
No ratings yet
Paul Bloom Et Alii Language and Space 1996
623 pages
Martin 2016 Meaning Matters
No ratings yet
Martin 2016 Meaning Matters
25 pages
Towards A Systemic Functional Analysis of Multisemiotic Mathematics Texts
No ratings yet
Towards A Systemic Functional Analysis of Multisemiotic Mathematics Texts
30 pages
Untranslatables: A World System: New Literary History, Volume 39, Number 3, Summer 2008, Pp. 581-598 (Article)
No ratings yet
Untranslatables: A World System: New Literary History, Volume 39, Number 3, Summer 2008, Pp. 581-598 (Article)
19 pages
Language in Social Group
No ratings yet
Language in Social Group
6 pages
Language Revitalization and Language Pedagogy New Teaching and Learning Strategies
No ratings yet
Language Revitalization and Language Pedagogy New Teaching and Learning Strategies
14 pages
Vilem Mathesius and Functional Sentence Perspective-Def
100% (1)
Vilem Mathesius and Functional Sentence Perspective-Def
11 pages
Applied Linguistics Made Easy
From Everand
Applied Linguistics Made Easy
Sitara Nambeesan
No ratings yet
Choosing a Mother Tongue: The Politics of Language and Identity in Ukraine
From Everand
Choosing a Mother Tongue: The Politics of Language and Identity in Ukraine
Corinne A. Seals
No ratings yet
Term Paper
No ratings yet
Term Paper
14 pages
Artifical Intelligence in Shipbuilding Industry
No ratings yet
Artifical Intelligence in Shipbuilding Industry
8 pages
Machine Learning in Materials Science
No ratings yet
Machine Learning in Materials Science
7 pages
Exploring Rnns For Analyzing Zeek HTTP Data: April 2019
No ratings yet
Exploring Rnns For Analyzing Zeek HTTP Data: April 2019
3 pages
Auto Encoder Driven Hybrid Pipelines For Image Deblurring Using NAFNET
No ratings yet
Auto Encoder Driven Hybrid Pipelines For Image Deblurring Using NAFNET
6 pages
Cse Iv A Titles
No ratings yet
Cse Iv A Titles
3 pages
Stripes Bit-Serial Deep Neural Network Computing
No ratings yet
Stripes Bit-Serial Deep Neural Network Computing
12 pages
Syllabuses FinTech 19 20 4Y
No ratings yet
Syllabuses FinTech 19 20 4Y
15 pages
Computational Intelligence for Information Retrieval Sensors Communication for Urban Intelligence 1st Edition Dharmender Saini - Download the ebook now to never miss important information
100% (1)
Computational Intelligence for Information Retrieval Sensors Communication for Urban Intelligence 1st Edition Dharmender Saini - Download the ebook now to never miss important information
79 pages
Full Download Deep Learning in Bioinformatics: Techniques and Applications in Practice Habib Izadkhah PDF
100% (5)
Full Download Deep Learning in Bioinformatics: Techniques and Applications in Practice Habib Izadkhah PDF
64 pages
On The Opportunities and Risks of Foundation Models: Corresponding Author: Pliang@cs - Stanford.edu Equal Contribution
No ratings yet
On The Opportunities and Risks of Foundation Models: Corresponding Author: Pliang@cs - Stanford.edu Equal Contribution
212 pages
Artificial Intelligence Notes - TutorialsDuniya
100% (1)
Artificial Intelligence Notes - TutorialsDuniya
118 pages
Artificial Intelligence in Mental Health
No ratings yet
Artificial Intelligence in Mental Health
6 pages
Integrating Memristors and CMOS For Better AI: News & Views
No ratings yet
Integrating Memristors and CMOS For Better AI: News & Views
2 pages
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
No ratings yet
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
13 pages
ISB LAI Brochure Batch 13
No ratings yet
ISB LAI Brochure Batch 13
27 pages
IJIVP Vol 12 Iss 2 Paper 8 2610 2614
No ratings yet
IJIVP Vol 12 Iss 2 Paper 8 2610 2614
5 pages
Activation Functions and Their Characteristics in Deep Neural Networks
No ratings yet
Activation Functions and Their Characteristics in Deep Neural Networks
6 pages
Instant Download Human-Centered AI Ben Shneiderman PDF All Chapters
100% (10)
Instant Download Human-Centered AI Ben Shneiderman PDF All Chapters
66 pages
Towards Autonomous Visual Navigation in Arable Fields
No ratings yet
Towards Autonomous Visual Navigation in Arable Fields
8 pages
3 b.tech . Cse Aiml 7 8 Sem
No ratings yet
3 b.tech . Cse Aiml 7 8 Sem
52 pages
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
No ratings yet
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
15 pages
A Review of Deep Learning Techniques For 3D Reconstruction of 2D Images
No ratings yet
A Review of Deep Learning Techniques For 3D Reconstruction of 2D Images
5 pages
MLT Unit-1
No ratings yet
MLT Unit-1
19 pages
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
No ratings yet
Natural Language Processing: Bachelor of Technology Computer Science and Engineering
7 pages
Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony
No ratings yet
Voice Activity Detection Using Deep Learning (Speech Enhancement) For GSM Telephony
68 pages
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
No ratings yet
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
7 pages
Ankit Resume
No ratings yet
Ankit Resume
1 page
SQL Injection Attack Detection Framework Based on HTTP Traffic
No ratings yet
SQL Injection Attack Detection Framework Based on HTTP Traffic
7 pages
Lesson 1
No ratings yet
Lesson 1
37 pages

Building Speech Recognition Systems With The Kaldi Toolkit PDF

Uploaded by

Building Speech Recognition Systems With The Kaldi Toolkit PDF

Uploaded by

Building

Speech Recogni0on Systems

Sanjeev Khudanpur, Dan Povey and Jan Trmal

Pos$ngs to Discussion List

• Automa0c transcrip0on of conversa0onal

Staying Ahead in the STT Game

Staying Ahead in the STT Game

Staying Ahead in the STT Game

Language Cantonese Tagalog Pashto Turkish

Staying Ahead in the STT Game

• Typically, a GPU is 10x faster than a 16 core CPU

Staying Ahead in the STT Game

• IARPA’s Automa0c Speech recogni0on In Reverberant

Alex Waibel, Kevin Lang, et al (1987) Our TDNN Architecture (2015)

• Consistent 5-10% reduc0on in word error rate (WER) over

• Data augmenta0on with simulated reverbera0on is beneﬁcial

Speaker Compensa$on Method Dev WER

• Modeling pronuncia0on and silence probabili0es yields modest but

Pronuncia$on/Silence Probabili$es Dev WER

• An RNN LM consistently outperforms the N-gram LM

Rank Par$cipant Dev WER System Type

Dev Test Eval

• Sequence training improves substan0ally over basic

• View MMI training as minimizing a condi0onal entropy

• The laVer does not require comminng to a single Wt

• Take inspira0on from the computa0on of Z ( Ot ; θ ) = ∑ P (π )

• Replace arc-probabili0es pi with the pair ( pi , pi log{pi} )

• View training criterion as MCE instead of MMI

DNN Training Method (hours of speech) Dev WER Test WER

Cross-Entropy Training (100h labeled) 32.0 31.2

Sequence Training (100h labeled) 29.6 28.5

Seq Training (100h labeled +250h weighted) 29.9 28.8

• Recovers about 40% of the supervised training gain

Staying Ahead in the STT Game

models don’t beneﬁt much

Staying Ahead in the STT Game

DNN Target CE sMBR

CTC Target CE sMBR

TDNN A (9.8M) 16.1% 10.7%

• LF-MMI reduces WER by ca 10%-15% rela>ve

Corpus and Training CE + sMBR LF-MMI

• Chain models with LF-MMI reduce WER by 6%-11% (rela>ve)

Staying Ahead in the STT Game

You might also like

•  Automa0c transcrip0on of conversa0onal

•  Typically, a GPU is 10x faster than a 16 core CPU

•  IARPA’s Automa0c Speech recogni0on In Reverberant

•  Consistent 5-10% reduc0on in word error rate (WER) over

•  Data augmenta0on with simulated reverbera0on is beneﬁcial

•  Modeling pronuncia0on and silence probabili0es yields modest but

•  An RNN LM consistently outperforms the N-gram LM

•  Sequence training improves substan0ally over basic

•  View MMI training as minimizing a condi0onal entropy

•  The laVer does not require comminng to a single Wt

•  Take inspira0on from the computa0on of Z ( Ot ; θ ) = ∑ P (π )

•  Replace arc-probabili0es pi with the pair ( pi , pi log{pi} )

•  View training criterion as MCE instead of MMI

•  Recovers about 40% of the supervised training gain

•  LF-MMI reduces WER by ca 10%-15% rela>ve

•  Chain models with LF-MMI reduce WER by 6%-11% (rela>ve)