The Kaldi Speech Recognition Toolkit
The Kaldi Speech Recognition Toolkit
I. I NTRODUCTION
Kaldi1 is an open-source toolkit for speech recognition
written in C++ and licensed under the Apache License v2.0.
The goal of Kaldi is to have modern and flexible code that is
easy to understand, modify and extend. Kaldi is available on
SourceForge (see http://kaldi.sf.net/). The tools compile on the
commonly used Unix-like systems and on Microsoft Windows.
Researchers on automatic speech recognition (ASR) have
several potential choices of open-source toolkits for building a
recognition system. Notable among these are: HTK [1], Julius
[2] (both written in C), Sphinx-4 [3] (written in Java), and the
RWTH ASR toolkit [4] (written in C++). Yet, our specific
requirementsa finite-state transducer (FST) based framework, extensive linear algebra support, and a non-restrictive
licenseled to the development of Kaldi. Important features
of Kaldi include:
Integration with Finite State Transducers: We compile
against the OpenFst toolkit [5] (using it as a library).
Extensive linear algebra support: We include a matrix
library that wraps standard BLAS and LAPACK routines.
Extensible design: We attempt to provide our algorithms
in the most generic form possible. For instance, our decoders
work with an interface that provides a score for a particular
frame and FST input symbol. Thus the decoder could work
from any suitable source of scores.
Open license: The code is licensed under Apache v2.0,
which is one of the least restrictive licenses available.
1 According to legend, Kaldi was the Ethiopian goatherd who discovered
the coffee plant.
and
External
Libraries
BLAS/LAPACK
GMM
Transforms
U>ls
LM
Tree
SGMM
Decodable
FST ext
HMM
Decoder
The acoustic model class AmDiagGmm represents a collection of DiagGmm objects, indexed by pdf-ids that correspond
to context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e.
GMMs). There are separate classes that represent the HMM
structure, principally the topology and transition-modeling
code and the code responsible for compiling decoding graphs,
which provide a mapping between the HMM states and the
pdf index of the acoustic model class. Speaker adaptation
and other linear transforms like maximum likelihood linear
transform (MLLT) [6] or semi-tied covariance (STC) [7] are
implemented by separate classes.
C. HMM Topology
It is possible in Kaldi to separately specify the HMM
topology for each context-independent phone. The topology
format allows nonemitting states, and allows the user to prespecify tying of the p.d.f.s in different HMM states.
D. Speaker adaptation
We support both model-space adaptation using maximum
likelihood linear regression (MLLR) [8] and feature-space
adaptation using feature-space MLLR (fMLLR), also known
as constrained MLLR [9]. For both MLLR and fMLLR,
multiple transforms can be estimated using a regression tree
[10]. When a single fMLLR transform is needed, it can be
used as an additional processing step in the feature pipeline.
The toolkit also supports speaker normalization using a linear
approximation to VTLN, similar to [11], or conventional
feature-level VTLN, or a more generic approach for gender
normalization which we call the exponential transform [12].
Both fMLLR and VTLN can be used for speaker adaptive
training (SAT) of the acoustic models.
E. Subspace Gaussian Mixture Models
For subspace Gaussian mixture models (SGMMs), the
toolkit provides an implementation of the approach described
in [13]. There is a single class AmSgmm that represents a whole
collection of pdfs; unlike the GMM case there is no class that
represents a single pdf of the SGMM. Similar to the GMM
case, however, separate classes handle model estimation and
speaker adaptation using fMLLR.
V. P HONETIC D ECISION T REES
Our goals in building the phonetic decision tree code were
to make it efficient for arbitrary context sizes (i.e. we avoided
enumerating contexts), and also to make it general enough
to support a wide range of approaches. The conventional
approach is, in each HMM-state of each monophone, to have
a decision tree that asks questions about, say, the left and
right phones. In our framework, the decision-tree roots can
be shared among the phones and among the states of the
phones, and questions can be asked about any phone in the
context window, and about the HMM state. Phonetic questions
can be supplied based on linguistic knowledge, but in our
TABLE I
BASIC TRIPHONE SYSTEM ON R ESOURCE M ANAGEMENT: %WER S
HTK
Kaldi
Feb89
2.77
3.20
Oct89
4.02
4.21
Test set
Feb91
3.30
3.50
Sep92
6.29
5.86
Avg
4.10
4.06
VIII. D ECODERS
from: http://hlt.fbk.eu/en/irstlm
from: http://www.speech.sri.com/projects/srilm/
class DecodableInterface {
public:
virtual float LogLikelihood(int frame, int index) = 0;
virtual bool IsLastFrame(int frame) = 0;
virtual int NumIndices() = 0;
virtual DecodableInterface() {}
};
TABLE II
BASIC TRIPHONE SYSTEM , WSJ, 20 K OPEN VOCABULARY, BIGRAM LM,
SI-284 TRAIN : %WER S
Bell
HTK (+GD)
KALDI
Test set
Nov92
Nov93
11.9
15.4
11.1
14.5
11.8
15.0
TABLE III
R ESULTS ON RM AND ON WSJ, 20 K OPEN VOCABULARY, BIGRAM LM,
TRAINED ON HALF OF SI-84: %WER S
Triphone
+ fMLLR
+ LVTLN
Splice-9 + LDA + MLLT
+ SAT (fMLLR)
+ SGMM + spk-vecs
+ fMLLR
+ ET
RM (Avg)
3.97
3.59
3.30
3.88
2.70
2.45
2.31
2.15
WSJ Nov92
12.5
11.4
11.1
12.2
9.6
10.0
9.8
9.0
WSJ Nov93
18.3
15.5
16.4
17.7
13.7
13.4
12.9
12.3
B. Other experiments
Here we report some more results on both the WSJ test sets
(Nov92 and Nov93) using systems trained on just the SI-84
part of the training data, that demonstrate different features that
are supported by Kaldi. We also report results on the RM task,
averaged over 6 test sets: the 4 mentioned in table I together
with Mar87 and Oct87. The best result for a conventional
GMM system is achieved by a SAT system that splices 9
frames (4 on each side of the current frame) and uses LDA
to project down to 40 dimensions, together with MLLT. We
achieve better performance on average, with an SGMM system
trained on the same features, with speaker vectors and fMLLR
adaptation. The last line, with the best results, includes the
exponential transform [12] in the features.
X. C ONCLUSIONS
We described the design of Kaldi, a free and open-source
speech recognition toolkit. The toolkit currently supports modeling of context-dependent phones of arbitrary context lengths,
and all commonly used techniques that can be estimated using
maximum likelihood. It also supports the recently proposed
SGMMs. Development of Kaldi is continuing and we are
working on using large language models in the FST framework, lattice generation and discriminative training.
ACKNOWLEDGMENTS
We would like to acknowledge participants and collaborators in the 2009
Johns Hopkins University Workshop, including Mohit Agarwal, Pinar Akyazi,
Martin Karafiat, Feng Kai, Ariya Rastrow, Richard C. Rose and Samuel
Thomas; Patrick Nguyen, for introducing the participants in that workshop
and for help with WSJ recipes, and faculty and staff at JHU for their help
during that workshop, including Sanjeev Khudanpur, Desiree Cleves, and the
late Fred Jelinek.
We would like to acknowledge the support of Geoffrey Zweig and Alex
Henrique (Rico) Malvar of Microsoft Research for allowing the use of his
FFT code; and Patrick Nguyen for help with WSJ recipes. We would like
to acknowledge the help with coding and documentation from Sandeep Boda
and Sandeep Reddy (sponsored by Go-Vivace Inc.) and Haihua Xu. We thank
Pavel Matejka (and Phonexia s.r.o.) for allowing the use of feature processing
code.
During the development of Kaldi, Arnab Ghoshal was supported by
the European Communitys Seventh Framework Programme under grant
agreement no. 213850 (SCALE); the BUT researchers were supported by the
Technology Agency of the Czech Republic under project No. TA01011328,
and partially by Czech MPO project No. FR-TI1/034.
The JHU 2009 workshop was supported by National Science Foundation
Grant Number IIS-0833652, with supplemental funding from Google Research, DARPAs GALE program and the Johns Hopkins University Human
Language Technology Center of Excellence.
R EFERENCES
[1] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,
G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,
The HTK Book (for version 3.4). Cambridge University Engineering
Department, 2009.
[2] A. Lee, T. Kawahara, and K. Shikano, Julius an open source realtime large vocabulary recognition engine, in EUROSPEECH, 2001, pp.
16911694.
[3] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf,
and J. Woelfel, Sphinx-4: A flexible open source framework for speech
recognition, Sun Microsystems Inc., Technical Report SML1 TR20040811, 2004.
[4] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Loo f, R. Schluter,
and H. Ney, The RWTH Aachen University Open Source Speech
Recognition System, in INTERSPEECH, 2009, pp. 21112114.
[5] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, OpenFst:
a general and efficient weighted finite-state transducer library, in Proc.
CIAA, 2007.
[6] R. Gopinath, Maximum likelihood modeling with Gaussian distributions for classification, in Proc. IEEE ICASSP, vol. 2, 1998, pp. 661
664.
[7] M. J. F. Gales, Semi-tied covariance matrices for hidden Markov
models, IEEE Trans. Speech and Audio Proc., vol. 7, no. 3, pp. 272
281, May 1999.
[8] C. J. Leggetter and P. C. Woodland, Maximum likelihood linear
regression for speaker adaptation of continuous density hidden Markov
models, Computer Speech and Language, vol. 9, no. 2, pp. 171185,
1995.
[9] M. J. F. Gales, Maximum likelihood linear transformations for HMMbased speech recognition, Computer Speech and Language, vol. 12,
no. 2, pp. 7598, April 1998.
[10] , The generation and use of regression class trees for MLLR
adaptation, Cambridge University Engineering Department, Technical
Report CUED/F-INFENG/TR.263, August 1996.
[11] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland,
Using VTLN for broadcast news transcription, in Proc. ICSLP, 2004,
pp. 19531956.
[12] D. Povey, G. Zweig, and A. Acero, The Exponential Transform as a
generic substitute for VTLN, in IEEE ASRU, 2011.
[13] D. Povey, L. Burget et al., The subspace Gaussian mixture model
A structured model for speech recognition, Computer Speech & Language, vol. 25, no. 2, pp. 404439, April 2011.
[14] M. Mohri, F. Pereira, and M. Riley, Weighted finite-state transducers
in speech recognition, Computer Speech and Language, vol. 20, no. 1,
pp. 6988, 2002.
[15] D. Povey and P. C. Woodland, Frame discrimination training for HMMs
for large vocabulary speech recognition, in Proc. IEEE ICASSP, vol. 1,
1999, pp. 333336.
[16] W. Reichl and W. Chou, Robust decision tree state tying for continuous
speech recognition, IEEE Transactions on Speech and Audio Processing, vol. 8, no. 5, pp. 555566, September 2000.
[17] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, Large
vocabulary continuous speech recognition using HTK, in Proc. IEEE
ICASSP, vol. 2, 1994, pp. II/125II/128.