0% found this document useful (0 votes)
48 views

Approaches To Automatic Lexicon Learning With Limited Training Examples

1) The document explores approaches for automatically generating pronunciations for words using limited hand-crafted training examples. 2) It proposes iteratively refining a grapheme-to-phoneme system by adding more pronunciations estimated from acoustic data to the training pool. 3) For English, the initial G2P models are inaccurate, so pronunciations must be refined using acoustic data. For Spanish, G2P is accurate but misses alternates, so free speech recognition helps add more.

Uploaded by

bajlooka
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Approaches To Automatic Lexicon Learning With Limited Training Examples

1) The document explores approaches for automatically generating pronunciations for words using limited hand-crafted training examples. 2) It proposes iteratively refining a grapheme-to-phoneme system by adding more pronunciations estimated from acoustic data to the training pool. 3) For English, the initial G2P models are inaccurate, so pronunciations must be refined using acoustic data. For Spanish, G2P is accurate but misses alternates, so free speech recognition helps add more.

Uploaded by

bajlooka
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

APPROACHES TO AUTOMATIC LEXICON LEARNING

WITH LIMITED TRAINING EXAMPLES

Nagendra Goel1 , Samuel Thomas2,


Mohit Agarwal3 , Pinar Akyazi4 , Lukáš Burget5 , Kai Feng6 , Arnab Ghoshal7, Ondřej Glembek5,
Martin Karafiát5 , Daniel Povey8, Ariya Rastrow2 , Richard C. Rose9 , Petr Schwarz5
1
Go-Vivace Inc., Virginia, USA, [email protected];
2
Johns Hopkins University, MD, [email protected]; 3 IIIT Allahabad, India;
4
Boǧaziçi University, Turkey; 5 Brno University of Technology, Czech Republic;
6
Hong Kong UST; 7 Saarland University, Germany;
8
Microsoft Research, Redmond, WA;9 McGill University, Canada

ABSTRACT In this paper, we explore some approaches for automatically gener-


ating pronunciations for words using limited hand-crafted training
Preparation of a lexicon for speech recognition systems can be a examples. To address the issues of using these dictionaries in differ-
significant effort in languages where the written form is not exactly ent acoustic conditions, or to determine a phone-set inventory, other
phonetic. On the other hand, in languages where the written form approaches have been proposed [?, ?]. Use of multiple pronuncia-
is quite phonetic, some common words are often mispronounced. In tions when a much larger amount of acoustic data is available for
this paper, we use a combination of lexicon learning techniques to those words is explored in [?].
explore whether a lexicon can be learned when only a small lexi- In order to cover the words that are not seen in the acoustic train-
con is available for boot-strapping. We discover that for a phonetic ing data, it is necessary to have a grapheme-to-phoneme (G2P) sys-
language such as Spanish, it is possible to do that better than what tem that uses the word orthography to guess the pronunciation of the
is possible from generic rules or hand-crafted pronunciations. For a word. Our main approach is to iteratively refine this G2P system
more complex language such as English, we find that it is still pos- by adding more pronunciations to the training pool if they can be
sible but with some loss of accuracy. reliably estimated from the acoustics.
Index Terms— Lexicon Learning, LVCSR We find that for a language like English, the G2P models trained
on a small startup lexicon can be very inaccurate. It is necessary
to iteratively refine the pronunciations generated by the G2P for
1. INTRODUCTION each word, while constraining the pronunciation search space to the
top N pronunciations. On the other hand, if the language is very
This paper describes work done during the Johns Hopkins Univer- graphemic in pronunciation, such as Spanish, G2P models may be
sity 2009 summer workshop by the group titled “Low Development very accurate, but miss a number of common alternate pronuncia-
Cost, High Quality Speech Recognition for New Languages and Do- tions. Therefore to add more alternates, it helps to use free phonetic
mains”. For other work also done by the same team also see [?] speech recognition and align it with the transcripts.
which describes work on UBM models, [?] which describes in more The rest of the paper is organized as follows. In Section 2 we de-
detail our work as it relates to cross-language acoustic model train- scribe the approaches we use to estimate pronunciations. We discuss
ing, and [?] which provides more details on issues of speaker adap- how we use these approaches for experiments using two languages -
tation in this framework. English and Spanish - in Section 3. Section 4 talks about the results
Traditionally pronunciation dictionaries or lexicons are hand- using the proposed approaches. We conclude with a discussion of
crafted using a predefined phone-set. For building ASR systems in a the results in Section 5.
new language, having a hand-crafted dictionary covering the entire
vocabulary of the recognizer can be an expensive option. Linguis- 2. PRONUNCIATION ESTIMATION
tically trained human resources may be scarce and prone to errors.
Therefore it is desirable to have automated methods that can leverage Theoretically, the problem of lexicon estimation of words can be
on a limited amount of acoustic training data and a small pronuncia- defined as
tion dictionary, to generate a much larger lexicon for the recognizer. ˆ
Prn = arg max P (Prn|W, X), (1)
Prn
This work was conducted at the Johns Hopkins University Summer
Workshop which was supported by National Science Foundation Grant where P (Prn|W, X) is the likelihood of the pronunciation given the
Number IIS-0833652, with supplemental funding from Google Research, word sequence and acoustic data. If optimized in a un-constrained
DARPA’s GALE program and the Johns Hopkins University Human Lan- manner (for the words for which acoustic data is available), each in-
guage Technology Center of Excellence. BUT researchers were partially
supported by Czech MPO project No. FR-TI1/034. Thanks to CLSP staff
stance of a word could potentially have a different optimal pronunci-
and faculty, to Tomas Kašpárek for system support, to Patrick Nguyen for ation. It has been found in practice that doing such an optimization
introducing the participants, to Mark Gales for advice and HTK help, and to without additional constraints does not improve the system’s perfor-
Jan Černocký for proofreading and useful comments. mance. Also, this approach is not applicable to words that have not
been seen in the acoustic training data. For these words it is neces-
sary to have a well trained G2P system. Table 1. Illustration of aligning phonetic and word level transcrip-
tions
Start Start
2.1. Deriving pronunciations from graphemes
frame word frame phoneme
We use the joint-multigram approach for grapheme-to-phoneme con- 10 w1 8 p1
version proposed in [?, ?] to learn these pronunciation rules in a data 11 p2
driven fashion. Using a G2P engine gives us one additional advan- ... ... ... ...
tage. Due to the statistical nature of the engine that we use, it is now 21 w2 21 p9
possible to estimate not only the most likely pronunciation of a word 24 p10
but also to get a list of other less likely pronunciations. Now we can 26 p11
split the pronunciation search into two parts. In the first part, we find 28 w3 28 p12
a set of N possible pronunciations for each word W by training a ... ... ... ...
G2P with a bootstrap lexicon. We then use the acoustic data X, to 47 w6 43 p25
choose the pronunciation Prn ˆ that maximizes the likelihood of the 44 p26
data. ... ...
Using a set of graphoneme (pair of grapheme and phoneme se- 48 p30
quence) probabilities, the pronunciation generation models learn the
best rules to align graphemes to phonemes. The trained models are to force align the data, we recreate the pronunciation dictionary with
used to derive the most probable pronunciation Prn ˆ for each word the new pronunciations to build new G2P models. This procedure
W , such that is repeated iteratively until the best performing acoustic models are
ˆ obtained. We do not retain multiple pronunciations in the dictionary
Prn = arg max P (W, Prn), (2) for each word as we did not find this to be helpful. Instead we pick
Prn
the pronunciation with the maximum number of aligned instances
where P (W, Prn) is the joint probability of the word and its possible for the word. Before using the resulting dictionary to train the G2P
pronunciations. Trained acoustic models are then used to derive the we also discard words where the chosen pronunciation had only one
most probable pronunciation Prn ˆ for each word W in the acoustic aligned instance in the data.
data. Using the acoustic data X, we approximate Eqn 1. as
2.2.1. Approach for phonetic languages such as Spanish
ˆ
Prn = arg max P (X|Prn)P (Prn|W ) (3) In the case of Spanish, since letter-to-sound (LTS) rules are very
Prn
simple, the G2P system does not generate sufficient alternates for
Limiting the number of alternate pronunciations for each word to the dictionary learning as described above. We therefore use an unsu-
top N pronunciations of the word and assuming P (Prn|W ) to be a pervised approach to generate an optimized pronunciation dictionary
constant for each word, Eqn 3. reduces to using the acoustic training data. Using an ASR system built with
the initial pronunciation dictionary, we decode the training data both
ˆ
Prn = arg max P (X|Prn), (4) phonetically and at the word level. We use the time stamps on these
Prn ∈ Top N pron. of W recognized outputs to pick a set of reliable phonetically annotated
The trained G2P models are used to generate pronunciations for words. The selection procedure is illustrated with an example be-
the remaining words in the training corpus and the recognition lan- low. Table 1 shows an illustration of the phonetic and word level
guage model of the ASR system, not present in the initial pronunci- recognition of a hypothetical sentence. The sentence is transcribed
ation dictionary. at the word level into the sequence of words - “w1 w2 w3 . . . w6 ”
and at the phonetic level into phonemes - “p1 p2 p3 . . . p30 ”.
In this example, we pick the phoneme sequence “p9 p10 p11 ”
2.2. Refining pronunciations as the pronunciation of the word w2 , as their phonetic and word
We start the iterative process of building a lexicon, using a initial alignments match. In this unsupervised approach, by indirectly using
pronunciation dictionary containing a few hand-crafted pronuncia- the likelihoods of the acoustic data, we rely on the acoustic data to
tions. We use this dictionary as a bootstrap lexicon for training G2P pick reliable pronunciations.
models as described in the previous section. Since we do not have
any trained acoustic models yet, we use the G2P models to gener- 2.3. Adding more pronunciations to the dictionary using un-
ate pronunciations for all the remaining words in the recognizer’s transcribed audio data
vocabulary. Our first acoustic models are now trained using this dic-
tionary. Using the best acoustic models trained in the previous step, new pro-
We now use this initial acoustic model to search for the best pro- nunciations are added to the pronunciation dictionary in this step.
nunciations of words as described earlier. In Eq.4, which is essen- We use the best acoustic model to decode in-domain speech from
tially a forced alignment step involving a Viterbi search through the different databases. The decoded output is augmented with a confi-
word lattices, pronunciations that increase the likelihood of the train- dence score representative of how reliable the recognized output is.
ing data are picked up. We use the set of pronunciations derived from The recognized output is also used as a reference transcript to force
this process to create a new pronunciation dictionary. This new pro- align the acoustic data to phonetic labels. For this forced alignment
nunciation dictionary, along with the initial pronunciation dictionary step we use a reference dictionary with the top N pronunciations
with hand-crafted pronunciations, is used to re-train the G2P models (for example, N =5) from the best G2P model. Using a threshold
and subsequently new acoustic models. Using these acoustic models on the confidence score, reliable words and their phonetic labels are
Create an initial dictionary with
limited training examples

Train new grapheme−to−


phoneme models

Update pronunciation
dictionary
Use the models to generate
pronunciations to train and test
the LVCSR system
Pick new pronunciations
for words

Do the
No acoustic Yes
Force align the training data
models
using the new models
improve?

Use best models to decode Use the models to generate


in−domain speech from Pick reliable pronunciations Train new grapheme−to−
phoneme models pronunciations to test
different databases and update dictionary
the LVCSR system

Fig. 1. Schematic of lexicon learning with limited training examples

speech database along with high out-of-vocabulary rates, use of


Table 2. Illustration of a decoded sentence along with confidence foreign words and telephone channel distortions make the task of
scores and aligned phonetic labels speech recognition on this database challenging. The conversa-
Start Confidence tional telephone speech (CTS) database consists of 120 spontaneous
frame word score phoneme telephone conversations between native English speakers. Eighty
4 w1 c1 =0.15 p1 conversations corresponding to about 15 hours of speech, form the
p2 training set. The vocabulary size of this training set is 5K words.
... ... ... ... Instead of using a pronunciation dictionary that covers the entire 5K
11 w3 c3 =0.93 p10 words, we use a dictionary that contains only the 1K most frequently
p11 occurring words. The pronunciations for these words are taken from
p12 the PRONLEX dictionary.
15 w4 c4 =0.84 p13 Two sets of 20 conversations, roughly containing 1.8 hours of
... ... ... ... speech each, form the test and development sets. With the selected
42 w7 c7 =0.96 p32 set of 1K words, the OOV rate is close to 12%. We build a 62K
p33 trigram language model (LM) with an OOV rate of 0.4%. The lan-
... ... guage model is interpolated from individual models created using
p48 the English Callhome corpus, the Switchboard corpus, the Gigaword
corpus and some web data. The web data is obtained by crawling the
selected. Table 2 shows an illustration of a decoded sentence along web for sentences containing high frequency bigrams and trigrams
with confidence scores for each word. The sentence is decoded into occurring in the training text of the Callhome corpus. We use the
a sequence of words - “w1 w2 w3 . . . w8 ” with confidence scores “c1 SRILM tools to build the LM. We use 39 dimensional PLP features
c2 c3 . . . c8 ”. Using the decoded sequence of words the sentence is to build a single pass HTK [?] based recognizer with 1920 tied states
also forced aligned into phonemes - “p1 p2 p3 . . . p48 ”. and 18 mixtures per state along with this LM.
In our case, we set a confidence score threshold of 0.9, and se- In our experiments our goal is to improve the pronunciation dic-
lect words like w3 with its phonetic transcription “p10 p11 p12 ”. We tionary such that it effectively covers the pronunciations of unseen
also remove pronunciations that are not clear winners against other words of the training and test sets. Figure 1 illustrates the iterative
competing pronunciations of the same word instance. We train G2P process we use to improve this limited pronunciation dictionary for
models after adding new words and their pronunciations derived us- English. We start the training process with a pronunciation dictio-
ing this unsupervised technique. nary of the most frequently occurring 1K words. This pronunciation
dictionary is used to train G2P models which generate pronuncia-
3. EXPERIMENTS AND RESULTS tions for the remaining unseen words in the train and test sets of the
ASR system. As describe in Section 2, we use the trained acoustic
For our experiments in English, we built an LVCSR system using models to subsequently refine pronunciations. The forced alignment
the Callhome English corpus [?]. The conversational nature of the step picks pronunciations that increase the likelihood of the training
Table 3. Word Recognition Accuracies (%) using different iterations Table 4. Word Recognition Accuracies (%) using different initial
of training for English pronunciation dictionaries for Spanish
Iteration 1 41.38 Using automatically
Iteration 2 42.0 generated LDC pronunciations 30.45
Iteration 3 41.45 Using optimized
Iteration 4 42.93 pronunciation dictionary 31.65
Iteration 5 42.77
Iteration 6 42.37 using this dictionary, we decode the training data both phonetically
and at the word level. As described in Section 2.2, we derive a set
Iteration 4 + new pronunciations
of reliable pronunciations by aligning these transcripts. We use this
from un-transcribed switchboard data 43.25
new dictionary to train grapheme-to-phoneme models for Spanish.
Full training dictionary 44.35 Similar to the English lexicon experiments, we train new acoustic
models and grapheme-to-phoneme models using reliable pronuncia-
data from a set of 5 most likely multiple pronunciations predicted by tions from a forced alignment step. Table 2 shows the results of our
the model. We select close to 3.5K words and their pronunciations experiments with the Spanish data. Using an improved dictionary
from this forced alignment step, after throwing out singletons and improves the performance of the system by over 1%.
words that don’t have a clear preferred pronunciation. These new
pronunciations along with the initial training set are then used in the 4. CONCLUSIONS
next iteration. We continue this iterative process as the performance
of the recognizer increases. We have proposed and explored several approaches to improve pro-
We start with models trained using only 1K graphonemes (word- nunciation dictionaries created with only a few hand-crafted sam-
pronunciation pairs). For each subsequent iterations, pronunciations ples. The techniques provide improvements for ASR systems in
from forced alignments are used to train new grapheme-to-phoneme two different languages using only few training examples. How-
models. Table 3 shows the word accuracies we obtain for different ever, the selection of the right techniques depends on the nature of
iterations of lexicon training. We obtain the best performances in the language. Although we explored unsupervised learning of lexi-
Iteration 4. We use G2P models of order 4 in this experiment. con for English, we did not combine that with unsupervised learning
To add new words and their pronunciations to the dictionary, of acoustic models. However we plan to do that and hope that this
we decoded 300 hours of switchboard data using the best acoustic would make a powerful learning technique for resource poor lan-
models obtained in Iteration 4. The decoded outputs were then used guages.
as labels to force align the acoustic data. Using the approach out-
lined in Section 2.4, we use a confidence based measure to select 5. REFERENCES
about 2.5K new pronunciations. These pronunciations are appended
to the pronunciation dictionary used in Iteration 4. We added the [1] D. Povey et. al., “Subspace Gaussian mixture models for
pronunciations with a precedence to ensure that words in the pronun- speech recognition”, in submitted to: ICASSP, 2010.
ciation dictionary have the most reliable pronunciations. We used [2] Lukas Burget et. al., “Multilingual acoustic modeling for
the order - limited hand-crafted pronunciations, followed by pro- speech Recognition based on subspace Gaussian mixture mod-
nunciations from forced alignment with best acoustic models and els”, in submitted to: ICASSP, 2010.
finally pronunciations from unsupervised learning, while allowing [3] Arnab Ghoshal et. al., “A novel estimation of feature-space
only one pronunciation per word. New grapheme-to-phoneme mod- MLLR for full-covariance models”, in submitted to: ICASSP,
els are trained using this dictionary. Without retraining the acoustic 2010.
models, we used the new grapheme-to-phoneme models to generate
a new pronunciation dictionary. This new dictionary is then used [4] T. Slobada and A. Waibel, “Dictionary learning for sponta-
to decoding the test set. Adding additional words and pronuncia- neous speech recognition”, in ISCA ICSLP, 1996.
tions using this unsupervised technique improves the performance [5] R. Singh, B. Raj and R.M. Stern, “Automatic generation of
still further from 42.93% to 43.25%. To verify the effectiveness of phone sets and lexical transcriptions”, in IEEE ICASSP, 2000,
our technique we use the complete PRONLEX dictionary to train the pp. 1691-1694.
ASR system. When compared to the best performance possible with [6] Chuck Wooters and Andreas Stolcke, “Multiple-Pronunciation
the current training set, the iterative process helps us reach within lexical modeling in a speaker independent speech understand-
1% WER difference with the full ASR system. We use G2P models ing system”, in ICSA ICSLP, 1994
of order 8 while training with the complete dictionary.
[7] Sabine Deligne and Frédéric Bimbot “Inference of variable-
In the second scenario of Spanish, the written form is phonetic
length linguistic and acoustic units by multigrams”, in Speech
and simple LTS rules are usually used for creating lexicons. For
Commun., vol. 23,3, 1997, pp. 223-241
our experiments, we build an LVCSR system using the Callhome
Spanish corpus. We attempt to improve the pronunciation dictio- [8] M. Bisani and H. Ney, “Joint sequence models for grapheme-
nary for this language by creating an optimized initial pronunciation to-phoneme conversion”, Speech Communication, vol. 50, no.
dictionary using the acoustic training data. Similar to the English 5, pp. 434-451, 2008.
database, the Spanish databases consists of 120 spontaneous tele- [9] A. Canavan, D. Graff, and G. Zipperlen, “CALLHOME Amer-
phone conversation between native speakers. We use 16 hours of ican English Speech,” Linguistic Data Consortium, 1997.
Spanish to train an ASR system as we described before. We use an [10] S. Young et. al., “The HTK Book,” Cambridge University En-
automatically generated pronunciation dictionary from Callhome as gineering Department, 2009.
the initial pronunciation dictionary. After training an ASR system

You might also like