Document
Document
80
1.3. LTS is a many-to-one Mapping
70 The current work was motivated by the observation that,
within a medium-sized surnames dictionary for RP English,
60 roughly 10% of ways of pronouncing a name have more
than one spelling. This is illustrated in Table 1 which shows,
for each domain dictionary, the numbers of unique ortho-
50
1000 10000 100000 1e+06 graphic and phonetic entries.
dictionary size
tions, and
(headwords), is the number of distinct pronuncia-
is the number (percentage in brackets)
of pronunciations which have more than one spelling.
age. However, to attain complete or near-complete token
coverage can require very many new types: 100% coverage (%)
of surname tokens would require the addition of more than
forenames 14962 13479 1747 (13.0)
5,000,000 new entries.
surnames 23746 21487 2641 (12.3)
So the number of new dictionary entries that are re-
streetnames 16211 15267 1358 (8.9)
quired to achieve complete coverage is huge, much too large
placenames 3668 3680 153 (4.2)
to be added by hand. Automatic methods must therefore be
sought which can provide high quality pronunciation pre-
dictions for names. Thus given a list of names which are not in a particular
dictionary, we hypothesize that about 10% of these names
do already have a valid pronunciation in the dictionary. The
1.2. A Hierarchical Approach
LTS problem for these names is then the task of finding the
Liberman and Church [6] recognised that the pronunciation mapping from OOV to in-vocabulary. In other words the
dictionary can be viewed as just the first in a series of fil- task is to try to find a homophone entry in the existing dic-
ters for predicting the pronunciation of a word. In their tionary.
approach, if a word is not found in the pronunciation dic- This problem is closely related to one in the field of
tionary, then attempts to predict the pronunciation are made “name retrieval”, in which database queries are made more
with a sequence of linguistically-motivated filters – these useful by allowing fuzziness in name matching. In name
include the addition of stress-neutral suffixes, rhyming and retrieval, the nearest matches to a search key (i.e. a name)
morphological decomposition. The first filter that fires pro- are returned as “hits”. These hits are found using a variety
duces the pronunciation. What all these filters have in com- of methods (reviewed in [8, 9]) which typically involve the
mon is that they generally do not produce output for every calculation of a distance between the key and each name in
input – it is only the last link in the chain which must be the database.
able to do that. The oldest of these techniques, Soundex and Phonix,
With such a hierarchical approach in mind, it makes perform the distance measure implicitly by attempting to
sense to look for new filters which can make sensible pre- map each word to a representation shared by its “sounda-
dictions for names which are not in the pronunciation dic- likes”. Soundex correctly identifies the names “Reynold”
tionary. A new filter does not have to have a very high firing and “Reynauld” as soundalikes, but it also pairs “Catherine”
rate. All that is required for it to be useful is that, when it and “Cotroneo”[8]. Explicit string edit distances have also
does fire, it produces predictions of a higher accuracy than been used in name retrieval, primarily for the identification
the links in the chain below it. From literature the quality of typing errors [9].
of predictions of automatically trained pronunciation rules Further developments have seen the combination of ex-
is in the region of 70-75% [4, 7] and the best results of other plicit string edit distances with phonetically-motivated sub-
techniques seem to be lower [5]. Therefore any filters with string transformations. The link with phonetics was made
a higher success rate than this have potential for improving explicit in Zobel and Dart’s [8] phonometric approach: LTS
string rewrite rules which are pronunciation-neutral in an nal word pair linsey=lynsey.
existing pronunciation dictionary. Given an OOV name, the Each of the rules is evaluated on the rest of the dictio-
algorithm tries to find a string rewrite rule which rewrites nary. For each entry in the dictionary, a particular rule will
the name to an in-vocabulary spelling. If it succeeds, then it do one of four things:
has found a homophone for the OOV word, and the pronun- MISS The pattern doesn’t match (e.g. bilton 2 )
ciation can simply be looked up in the dictionary.
OOV The pattern matches, but the resulting mapping is not
The algorithm will now be described in detail, first by
showing how the model for spelling variation is trained from
in the dictionary (e.g. linton lynton, but lynton is ,
OOV)
an existing dictionary, and then by discussing how the model
is used to make pronunciation predictions for words which DIFF The pattern matches, the resulting mapping is in the
are OOV. dictionary, but the pronunciations are different (e.g.
,
tin tyn, but /t i1 n/ /t ii1 n/
.
2.1. Training GOOD The pattern matches, the resulting mapping is in
the dictionary, and the pronunciations are the same.
The starting point for training is a dictionary which gives
partial coverage of the domain in question. We favour us-
(e.g. linne ,
lynne, and both are pronounced /l
i1 n/)
ing a domain-specific dictionary for this rather than a gen-
eral purpose dictionary, since we suspect that the nature of
spelling variation is domain-dependent. signed four scores: ,
(0/122 (4 35376 (891:;:
Counting over the whole dictionary, each rule
, and
is as-
.
0( <=3>398
The first stage is to create a reverse dictionary, which Collectively these scores reflect how useful the rule is – how
maps pronunciations to orthography. All entries in the re- often it can be expected to fire, how often it will map into the
verse dictionary which map one pronunciation to just one dictionary, and how often it makes a pronunciation-neutral
spelling are then removed. For the remainder, each pair of mapping.
!#"$#%&%'(*)+"
spellings which share a pronunciation are used to generate a Of the , just one rule is chosen for inclusion in the rule
sequence of rewrite rules . Each rewrite set. Currently, the heuristic for choosing the best rule from
,
rule is of the form A B / L R where the pattern A each set is simply to choose the shortest rule which is al-
with L as left context and R as right context is replaced with
maps into the dictionary (
(?8@1:;: A
ways pronunciation-neutral when its pattern matches and it
). In future it may be
the string B.
Consider an example: the pronunciation / l i1 n . advantagous to add sophistication to this part of the tech-
z ii2 / is shared by the spellings linsey and lynsey (lin- nique.
sey=lynsey). Table 2 shows the postulated rewrite rules. The above process is repeated for all other spelling pairs,
The first rewrite rule is obtained by identifying, then to yield a list of substitution rules.
removing, the common prefix and suffix between the two 2 All examples in this list apply to rule BC in Table 2.
[2] Mehryar Mohri and Richard Sproat, “An efficient [14] M. Crochemore and C. Hancart, “Automata for match-
compiler for weighted rewrite rules,” in Meeting of ing patterns,” in Handbook of Formal Languages,
the Association for Computational Linguistics, 1996, G. Rozenberg and A. Salomaa, Eds., vol. 2, pp. 399–
pp. 231–238. 462. Springer-Verlag, 1997.