0% found this document useful (0 votes)
5 views

1998issues in Building General Letter To Sound Rules

l

Uploaded by

zzlsaw98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

1998issues in Building General Letter To Sound Rules

l

Uploaded by

zzlsaw98
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Issues in Building General Letter to Sound Rules

 

Alan W Black Kevin Lenzo Vincent Pagel

[email protected], CSTR, University of Edinburgh,





[email protected], Carnegie Mellon University,
[email protected], Faculté Polytechnique de Mons

ABSTRACT 2. LETTER-PHONE ALIGNMENT


In general text-to-speech systems, it is not possible to guarantee that In order to make the building of models easier we wish to have
a lexicon will contain all words found in a text, therefore some sys- a standardized alignment between the letters in an entry and the
tem for predicting pronunciation from the word itself is necessary. phones in its pronunciation.

Here we present a general framework for building letter to sound The number of letters in a word and the number of phones in its pro-
(LTS) rules from a word list in a language. The technique can be nunciation in general are not a one to one match. For the languages
fully automatic, though a small amount of hand seeding can give we have investigated, letters can map to zero, one, two or very ex-
better results. We have applied this technique to English (UK and ceptionally three phones. Even when there are the same number
US), French and German. The generated models achieve, 75%, of letters and phones the “correct” alignment may not be the most
58%, 93% and 89%, respectively, words correct for held out data simple. In general there seems to be less phones than letters.
from the word lists.
The cases where a letter goes to more than one phone are fairly
To test our models on more typical data we also analyzed general restricted (e.g. x to /k s/, o to /w uh/ as in one). Almost all
text, to find which words do not appear in our lexicon. These un- letters can in some context correspond to no phone, which we will
known words were used as a more realistic test corpus for our mod- call epsilon .
els. We also discuss the distribution and type of such unknown
words. A more complex model involving multi-letter clusters to zero or
more phones is also possible though this introduces complexities
1. INTRODUCTION in the model learning, and alignment process that we preferred to
avoid.
Given that lexicons are closed by their nature and that the input text
for general text to speech (TTS) systems is open, there will always Ideally we would like a purely automatic method for finding the best
be words in the text which are not contained within even the largest single letter alignments, but so far we have achieved better results
lexicon. Even when a large lexicon can be constructed to cover the from a hand-seeded method.
whole vocabulary it would be useful to find a principled method to
The hand-seeded method requires the explicit listing of which
reduce the size of the lexicon (which we discuss more fully in [11]).
phones (or multi-phones) each letter in the alphabet may correspond
In many languages the orthographic system has some relationship to to, irrespective of context. This is relatively easy to do and can be
the pronunciation, depending on the language it may be trivial (such done as an interactive process over the training set as new corre-
as in languages like Spanish) or relatively difficult (like English), or spondences are added to the allowables list. For example the letter
harder (such as in Japanese with full kanji). Humans can (often) “c” may be realized as any one of
pronounce words reasonably even when they have never seen them
before. It is that ability we wish to capture automatically in an LTS
epsilon k ch s sh t-s
rule system.

Here we present a method for taking large lists of words and pro- Vowel letters have typically a much longer list of potential phones.
nunciations and building generalized rule systems that not only pro-
duce reasonable pronunciations for unseen words but also allow us The hand-seeded algorithm takes the list of allowables and finds
to remove the regular examples from the list so that much smaller all possible alignments between each entry’s letters and phones. A
lexicons are adequate for the same coverage. count is taken for which correspondences are used for each align-
ment and a table of probabilities of a phone (epsilon or multi-phone)
given a letter is estimated, again irrespective of context. Then the
entries are re-aligned and each possible alignment is scored with So we can see clearly that the hand-seeded method is better. How-
the generated probabilities. The best alignment is selected. The ever we still feel that the hand-seeded is a simple task and feel that
alignments generated by this algorithm are close to what would be we have not yet fully investigated method to improve the automatic
produced by hand and it is very rare to find alignments that would method to achieve the level of the hand-seeded method.
be considered unacceptable.
3. BUILDING RULES
The building of the allowables table is simple and quick though does
require some skill, however it can be done even without an in-depth Once an alignment is found we can train a phone prediction model.
knowledge of the language the lexicon is for. A few words do not In our work we have used decision tree technology [3] as we feel
produce alignments (which would require new entries in the allow- this is simple and produces compact models. We also feel that other
ables table) which typically represent classes for which the relation- learning techniques would not produce significantly better results.
ship between the letter form and the phones is too opaque. These
are typically abbreviations, such us “dept” as /d ih p aa r t m ah n For each letter in the alphabet of the language we trained a CART
t/; words with very unusual pronunciation e.g. “lieutenant” (British tree given the letter context (three either side) to predict epsilon,
English); Foreign words (e.g. “Lvov”) and what could be consid- phone or double phone from the aligned data. One can build a single
ered mistakes in the lexicon e.g. “cannibalistic” with two /l/ phones. tree without any significant difference in the accuracy but building
Typically the number of entries that failed to have an alignment are separate trees is faster and allows for parallelization.
well under 1%.
We split the data into train and test data by removing every tenth
The second alignment is an application of the expectation maxi- word from the lexicon. This means that the data set contains only
mization (EM) algorithm [7] which we call the “epsilon scattering one occurrence of each word and hence word frequency is ignored.
method”. The idea is to estimate the probabilities for one letter Another factor is that as these lexicons usually contain many mor-
to match with one phoneme , and to use DTW to introduce ep-

phological variations, it is likely there will be a similar word or
silons at positions maximizing the probability of the word’s align- words in the training set.
ment path. Once the dictionary is aligned, the association probabil-
ities can be computed again, and so on until convergence. e.g. five We removed short words (under four letters) from the training and
iterations are necessary on the CMU lexicon test sets as these words are typically function words which in gen-
eral may have non-standard pronunciations, or are abbreviations
(e.g. “aaa” as /t r ih p ah l ey/) which have little or no relationship
Algorithm:
with their pronunciation. Also, where part of speech information
/* initialize prob(L,P) */ was available, we removed all non-content words. The reasoning is
1 foreach word in training_set that unknown words are typically not the most common words and
count with DTW all possible L/P in general unknown words will have more standard pronunciations
association for all possible epsilon rather than idiosyncratic ones.
positions in the phonetic
transcription We have so far tried this technique on four lexicons, Oxford
/* EM loop */ Advanced Learners Dictionary of Contemporary English (OALD)
2 foreach word in training_set (British English) [10], CMUDICT (US English) [4], BRULEX
compute new_p(L,P) on alignment_path (French) [5] and the German Celex Lexicon [1].
3 if (prob != new_p) goto 2
Correct
This differs from [6] in that the probabilities are distributed equally Lexicon Letters Words
(’scattered’) among each of the possible alternatives, rather than as- OALD 95.80% 74.56%
signing an arbitrary weight to each shift. CMUDICT 91.99% 57.80%
BRULEX 99.00% 93.03%
When we build models from the results of alignment using each of
the above algorithms on the OALD we get the follow results DE-CELEX 98.79% 89.38%

Method Letters Words CMUDICT, although also English, does not get as good results
Epsilon scattering 90.69% 63.97% compared with OALD as it contains many more “foreign” words,
Hand-seeded 93.97% 78.13% particularly names, which are much harder to predict without any
higher level information (such as ethnic origin).

“Letters correct” is the number of letter-phone pairs which are cor- The above results are the best results achieved after testing various
rectly predicted with respect to the test set. “Words correct” are the parameters in the CART building process. Particularly we varied
number of complete words where the complete phone string pre- the “stop” value which specifies the minimum number of examples
dicted (minus epsilons, but including stress markers) is correct with necessary in the training set before a question is hypothesized to
respect to the test set. distinguish the group. Normally the smaller the stop value the more
over-trained the models may become. However the following table The second model introduced two types of vowel phone, stressed
shows the results for OALD, tested on held out data, while changing and unstressed versions. The standard LTS model building tech-
the stop value nique was applied so the CART trees themselves produced phone
and stressing information directly (LTPS).
Correct
Stop Letters Words Size LTP+S LTPS
8 92.89% 59.63% 9884 LNS 96.36% 96.27%
6 93.41% 61.65% 12782 Letter — 95.80%
5 93.70% 63.15% 14968 WNS 76.92% 74.69%
4 94.06% 65.17% 17948 Word 63.68% 74.56%
3 94.36% 67.19% 22912 (LNS = letter/phone ignoring stress, WNS = word ignoring stress)
2 94.86% 69.36% 30368
1 95.80% 74.56% 39500
A score for “letters correct” for the separate model is not available
as the stress prediction model does not preserve alignment.
As the stop value is reduced, the size of the model increases. The
model size is the total number of questions and leaf nodes in the Thus it can be clearly seen that although higher values are possi-
generated CART trees. However it appears that more finely tuned ble per word when ignoring stress, a separated model applied after-
data is always better, such that even with stop value 1 the model is wards gives significantly lower results than if the phones and stress
not over-trained. levels are predicted by a single model.
Note that comparisons with other LTS training techniques are not We also discovered that including part of speech information in
that easy. As when the train/test sets differ, and when the domains the phone prediction models themselves improved the accuracy of
differ there can be no direct comparisons. For example if we remove the model. Without POS information the combined model gives
proper names from the OALD and train and test on the remainder 95.32% letter correct and 71.28% word correct. Thus part of speech
our word correct score goes up to 80%. However the above results obviously helps and is readily available in a TTS system with a stan-
compare favorably with other systems using similar data sets (e.g. dard POS tagger even for unknown words.
[8]).
Ultimately stress cannot be predicted on local context alone as there
4. STRESS ASSIGNMENT are a number of example in English where local context is insuffi-
cient (cf. photograph/photography). Ideally morphological decom-
The importance, and realization of lexical stress varies between lan- position is required to do such prediction but we have not yet inves-
guages but in order to produce a reasonable pronunciation from a tigated this area.
string of letters it is often more than simply producing a string of
phones, lexical stress markings are also required. In English lexical 5. DOES IT REALLY WORK
stress may be different depending on syntactic class, it may even
move with some morphological derivations. Therefore predicting To find out a more realistic assessment of these models’ treatment
lexical stress for each vowel in the predicted string cannot in general of unknown words we processed the first section of the WSJ Penn
be done from the letter context alone. However results in [12] sug- Treebank [9]. This consists of a total of 39923 words in news text
gest that combining phone and stress prediction in a single model style. Using our standard OALD lexicon we find that a total of 1775
give better results. words (4.6%) are not found in the lexicon, 943 of which are unique.
Of those unknown words we find the following distribution
We tested this on the OALD data set. We first built letter to phone
models where lexical stressing information was removed from the
phones and we trained a separate stress prediction model using the Occurs %
same test set using features such as syllable position in word, vowel names 1360 76.6
length, vowel height, number of syllables from end of word, and unknown 351 19.8
part of speech. On held out data from the OALD the per syllable American spelling 57 3.2
results are typos 7 0.4

Actual Predicted American spelling of words is distinguished here (e.g. “honor”,


unstressed stressed % “center”) as it is so systematic. As OALD is a British English
unstressed 7390 378 95.1% Lexicon it doesn’t contain such spellings, though for TTS use it
stressed 512 8207 94.1% obviously should. As WSJ is more carefully published than other
total correct 15597/16487 (94.6%) texts such as email, the issue of typos is almost negligible. We have
done similar analysis of unknown words from Time magazine arti-
This model was combined with the output of the letter to phone cles finding a very similar distribution and ratio of unknowns, thus
model (LTP+S). we feel the above is typical of news story type text.
We listened to each of the 1775 words as pronounced by a num- pronunciations. We have successfully built LTS models for four dif-
ber of the models discussed above. A yes/no decision was made ferent languages and feel confident this process will work for many
about acceptability. Note that a number these words have multiple other languages. As well as quoting results from held out data from
acceptable pronunciations. If any of those were predicted they were the words lists used for training, we also present results of applying
deemed acceptable. For example the pronunciations of “Reagan” as one model to unknown words from news text.
/r ey g ah n/ and as /r iy g ah n/ were both considered acceptable.
This method is fully implemented and documented
The best results, shown above for OALD, were obtained by build- and distributed with the Festival Speech Synthesis Sys-
ing the deepest possible trees. But when those models were applied tem [2], or a PERL implementation is available from
to these unknown words the results showed that although the mod- http://www.cs.cmu.edu/ lenzo/t2p.
els were not over-trained for the unseen test set extracted from the
lexicon itself, they were for these unknown words. The following ACKNOWLEDGEMENTS
shows the results after varying the stop value for CART building.
We gratefully acknowledge the support of the UK Engineering and
Lexicon Unknown Physical Science Research Council (EPSRC grant GR/K54229),
Stop Test set Test set size the US National Science Foundation graduate research fellowship
scheme, and Oregon Graduate Institute for providing access to the
1 74.56% 62.14% 39500
German CELEX lexicon.
4 65.17% 67.66% 17948
5 63.15% 70.65% 14968
6 61.65% 67.49% 12782
7. REFERENCES
1. R. Baayen, R. Piepenbrock, and L. Gulikers. The CELEX lex-
Thus the best model for unknown words is not the best model for the ical database (cdrom). Linguistic Data Consortium, University
held out lexical entries. What is more, the best model for unknown of Pennsylvania, Philadelphia, 1995.
words is less than 40% the size of the best model for the lexical 2. A. Black, P. Taylor, and R. Caley. The Festival speech synthesis
test set. These figures reflect both the fact that the held out data in system. http://www.cstr.ed.ac.uk/projects/festival.html, 1998.
the lexical test set (every tenth entry) is often just a morphological
3. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classifi-
variation of the entries around it, and secondly the lexical test set
cation and Regression Trees. Wadsworth & Brooks, Pacific
does not take into account word frequency of unknown words.
Grove, CA., 1984.
Looking at those words that are pronounced wrongly we find some 4. CMU. Carnegie Mellon Pronuncing Dictionary.
mistakes are still recognizable (e.g. Chrysler as /k r ih s l ah er/) but http://www.speech.cs.cmu.edu/cgi-bin/cmudict, 1998.
many are unacceptable and unrecognizable showing there is still 5. A. Content, P. Mousty, and M. Radeau. Une base de données
work to be done. Further analysis of these words shows lexicales informatisée pour le francais écrit et parlé. L’Année
Psychologique, 90:551–566, 1990.
Occurs % 6. W. Daelemans and A. van den Bosch. Language-independent
names 413 79 data-oriented grapheme-to-phoneme conversion. In J. van San-
unknown 94 18 ten, R. Sproat, J. Olive, and J. Hirschberg, editors, Progress in
American spelling 7 0 speech synthesis, pages 77–90. Springer Verlag, 1996.
typos 2 0 7. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the
One would expect proper names to be the hardest to pronounce (es- Royal Statistical Society, 39 (Series B:1–38, 1977.
pecially those of foreign origin) but although it appears they are 8. R. Luk and R. Damper. Stochastic phonographic transduc-
slightly harder our model seems to do as well on them as other non- tion for english. Computer Speech and Language, 10:133–153,
names. 1996.
9. M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a
Further analysis of the types of names that are still unpronounce- large annotated corpus of English: the Penn Treebank. Compu-
able shows a larger proportion of non-anglo-saxon origin than in tational Linguistics, 19:313–330, 1993.
those that are correctly pronounced. As many of the languages these
10. R. Mitten. Computer-usable version of Oxford Advanced
names originate from often have a more standardized pronunciation
Learner’s Dictionary of Current English. Oxford Text Archive,
than English (e.g. Polish, Italian, Japanese (in its romanized form)),
1992.
knowing the origin of an unknown word may allow more specific
rules to be applied, but we have not yet investigated this area. 11. V. Pagel, K. Lenzo, and A. Black. Letter to sound rules for
accented lexicon compression. In ICSLP98, Sydney, Australia.,
6. SUMMARY 1998.
12. A. van den Bosch, T. Weijters, and W. Daelemans. Modular-
We have presented automatic (and near automatic) processes for ity in inductive-learned word pronunciation systems. In Proc.
building letter to sound rules systems from lists of entries and their NeMLaP3/CoNNL98, pages 185–194, Sydney, 1998.

You might also like