damerau1989_an_examination_of_undetected_typing_errors
damerau1989_an_examination_of_undetected_typing_errors
oO
Printed in Great Britain. Copyright 0 1989 Pergamon Press plc
Abstract-We examine the effect of increasing word list size on the error rate of spell-
ing correctors. An experiment on a large body of text shows that an increase in the word
list size decreases the error rate.
1. INTRODUCTION
A typical spelling correction program detects spelling errors by searching through a (vir-
tual)* list of correctly spelled words. A word is determined to be spelled correctly only
when it occurs in the list. Since some simple misspellings of words result in correct words,
this technique will fail to recognize certain incorrectly spelled words. In the sentence, The
Mets beta the Red Sox to win the 1986 World Series, the simple misspelling of beat as beta
will go undetected. Peterson [l] made a study of the number of possible typing errors that
would go undetected as the size of the list of correctly spelled words increases. As more
words are added to the list, fewer correctly spelled words will be detected as misspellings,
but more incorrectly spelled words will pass as correct words. If a word list did not con-
tain beta, the misspelling of beat would be detected, but any correct uses of beta would
be determined to be misspelled. Since the possible undetected typing errors as a fraction
of the possible typing errors ranges from 10 percent for a 50,000 word list to 15 percent
for a 350,000 word list, there exists a possible problem in using large word lists for spell-
ing correction [l]. In this paper, we report on the results of an investigation into the actual
occurrence of these possible errors in a large body of text.
We hypothesize that the number of possible undetected errors is significantly higher
than their actual occurrence. That is, the kinds of errors that are actually made by peo-
ple are governed by some principle that differs significantly from random behavior. We
performed a two phase experiment on a sample of over 20 million words of text. The first
phase listed a sample of sentences that contained a word of rank** greater than 5000.
The second phase listed all sentences that contained a word whose rank was between 50,000
and 60,000 and had a misspelling with rank below 50,000. A slightly modified second phase
listed all sentences containing a word whose rank was between 50,000 and 60,000. (In our
text sample, the words below rank 5000 accounted for 88 percent of the words in the text
and the words below rank 60,000 accounted for 99.99 percent of the words in the text.)
These sentence lists were then manually examined to find actual misspellings. We believe
the hypothesis was verified.
For the purposes of this study (and Petersen’s study [l]), a misspelling is more pre-
cisely a mistyping as defined by Damerau [2]. A word may be mistyped as a result of
exactly one of the following mistakes:
*Some implementations may do morphological analysis or employ compact encodings, but the actual data
structure is irrelevant.
**The rank of a word is its position in a list of words sorted by descending frequency.
659
660 FRED J. DAMERAU and ERIC MAYS
Errors of this kind typically account for over 80 percent of all errors [2], [l], although spe-
cial circumstances may show different statistics. A recent paper by Mitton [3] presents data
from English school compositions showing quite different percentages. The data come
from students, many of whom appear to have only a sketchy idea of correct spelling. It
is clear from this paper that standard spelling correction algorithms would be of little use
to this population. We assume that commercially important applications do not have this
characteristic.
2. SOURCES
The following sources were used to build our list of correctly spelled words.
From these lists we eliminated possessives (words ending in ‘s or s’), and proper nouns
(words containing capital letters). The resulting merged word list contained 257,528 words,
all presumed correctly spelled.
The following sources were used both to determine word rank and as a source of
potential misspellings.
Table 1 lists for each text source the number of words (Words heading), the number
of unique words (Unique heading), and the number of unique words that appeared in the
merged wordlist (Unique verified heading). The number of words in the IBM PC and
Arpanet AI samples varied among the phases in our experiment. The actual number of
words in each phase will be noted accordingly. Note that we had no control over the
amount of proofreading or automatic spell checking that was applied to any of the sources.
However, our informal queries of people that submit to the bulletin boards leave us with
the impression that very little proofreading or spell checking occurs.
The combined sources contained 61,519 unique words occurring in the list of verified
words. We collected word frequency data and assigned a word rank for each source inde-
pendently. Since some sources contained words ranking high only within that source, a
method for removing that bias was needed to obtain a reasonable combined ranking. If
each of the ranks differed from the mean by no more than the mean raised to 0.85, the
mean was used. Otherwise, if a single rank caused the mean to fall outside of the 90 per-
cent level of confidence using a Student’s t Distribution, that rank was not used in com-
puting the mean.
3. EXPERIMENT
In the first phase of the experiment we sampled sentences that contained a word with
a rank greater than 5000. This sample consisted of 14,917 sentences from office correspon-
dence, 17,074 sentences from the Arpanet, 11,567 sentences from the Canadian Parliament,
and 22,585 sentences from the PC bulletin board. Based on a manual examination of the
data, we compiled a list of the incorrect spellings. All of the sentences containing these mis-
spelled words were then exhaustively listed from the sources. For this phase the AI source
contained 2,418,279 words and the PC source contained 9,010,623 words.
The result of the examination of these sentences for incorrect spellings is found in
Tables 2 and 3. The tables list the misspelling (Typo heading), the intended word ( Word
Office AI Canada PC
TYPO Word C T C T C T C T
aam A.M. 0 2
adz ads 0 1
ait it 0 1
cant can’t 0 1 0 8 0 68
charied chaired 0 2
col=lections collections 0 2 0 4
dont don’t 0 18 0 156
fip flip 0 1
firs first 0 1 0 1
havent haven’t 0 1 0 19
i=ll I’11 0 1
lier liar 0 1 0 5
mom0 memo 1* 2
od do 0 1 0 1
Pease please 0 2
quires requires 0 1
sill still 0 1
stree street 0 1
te the 0 1 0 2 0 4 0 3
tempi tempo 0 1
sixteen=th sixteenth 0 1
th the 0 8 0 2 0 g 0 44
wi=th with 0 6 0 1
thats that’s 0 9 ?
until1 until 0 12
whar what 0 1
whats what’s 0 1 ?
Office AI Canada PC
TYPO Word C T C T C T C T
origin=ally originally 0 1 1 1
ant and 6 6 0 1 1 3
coping copying 9 9 7 7 6 10
dependant dependent 0 3 1 2 0 52
discrete discrete 1 2 2 4
equip=ment equipment 3 5 4 4 13 13 2 2
lets let 0 1
lets let’s 0 12 33 44 I 8 299 368
wont won’t 1 3 2 2 1 35
662 FRED J. DAMERAUand ERIC MAYS
heading), and by each source the number of correct uses (C heading) and total uses
(T heading). An extra space in a word in the Typo column is indicated by an equal sign.
Table 3 lists misspellings for which no correct uses of the misspelled word occurred.
It is thus questionable whether these misspelled words should be in the word list. Many are
archaic uses. A few suffixes were in our master word list, and would not appear in the
word list of a spell checker. Note the large number of errors due to omission of the apos-
trophe in contractions. The one “correct” use of memo was due to a reference in corre-
spondence to the previous misspelling. (Zt ain’t a morno, it’s a memo.)
Table 2 consists of misspellings for which there were occurrences of correct uses.
Again a large number of misspellings are due to omission of apostrophe in contractions.
Several errors are due to word division. Note that for most words a significant number of
correct uses occurred. Thus there would be a real penalty in omitting these words from a
word list. For example, if coping were omitted from the word list, the 4 misspellings of
copying would be detected. However, the 22 correct uses of coping would be flagged as
misspelled.
In the second phase we listed all sentences that contained a word whose rank was
between 50,000 and 60,000 and had a possible misspelling resulting in a word with rank
less than 50,000. By doing this, we hoped to determine the incremental effect of adding
10,000 words, since words having no possible misspellings with rank less than 50,000 have
no effect. Before eliminating words without potential misspellings, this would result in the
consideration of 10,000 words. However, there was a very large grouping of words of iden-
tical rank just less than 50,000. Thus, only 881 words were considered, due to the group-
ing and misspelling restriction. In this phase the AI source contained 1,637,234 words, and
the PC source contained 15,842,241 words.
As a slight modification we listed all sentences that contained a word whose rank was
between 50,000 and 60,000. This time, however, we corrected the oversight of not including
the group of words just less than rank 50,000, thus we considered 9806 words. In this phase
the AI source contained 2,569,975 words, and the PC source contained 15,661,973 words.
Since our restricted form of misspelling is symmetric (if word A can be misspelled as word
B, then B can be misspelled as A), this allows us to determine the effect of adding the next
10,000 words to a 50,000 word list.
The data appear in Tables 4 and 5. In the modified phase, we found 19 errors in a
22,663,427 word sample, and 3240 correct uses of the 9806 additional words. That is,
adding the next 10,000 words to a 50,000 word dictionary eliminates 3240 incorrect mis-
spelling flags, at the cost of missing 19 actual cases of misspelling. This is a ratio of
approximately 150 to 1. The data in Table 4 show 23 errors in a 21,910,954 word sample,
and 1348 correct uses. To detect these 23 errors, an additional 1348 words would be flagged
as incorrect, a ratio of approximately 50 to 1 -even with the elimination of words with-
out potential misspellings.
4. CONCLUSION
On the one hand, the number of words that are misspelled as real words (and thus not
flagged by the spelling corrector) increases as the size of the word list grows. On the other
hand, correct words that are incorrectly flagged as misspelled increase as the word list size
shrinks. In Petersen’s [l] conclusions it is suggested that “word lists used in spelling pro-
grams should be kept small; a large word list is not necessarily a better word list.” Our
study challenges this conclusion. In a spelling corrector which employed a 50,000 word list
we found the resulting differential error rate was in one case 50, and in a second case 150,
times greater than when a 60,000 word list was used.
Until content based spelling correctors are available, a useful technique might be to
flag infrequent words as potentially misspelled. These infrequent words could appear (ap-
propriately noted) as possible corrections. For example, comportment would be flagged
as potentially misspelled. The list of possible corrections would be comportment (low fre-
quency) and compartment.
Acknowledgements- James Griesmer and Warren Plath provided useful comments on an earlier draft of this
paper.
REFERENCES
1. Petersen, J.L. A note on undetected typing errors. Communications of the ACM 29(7): 633-637, July 1986.
2. Damerau, F.J. A technique for computer detection and correction of spelling errors. Communications of the
ACM 7(3): 171-176, March 1964.
3. Mitton, Roger. Spelling checkers, spelling correctors, and the misspellings of poor spellers. Information Pro-
cessing and Management 23(5): 495-505, 1987.
4. Heidorn, G.E., Jensen, K., Miller, L.A., Byrd, R.J., and Chodorow, M.S. The Epistle text-critiquing sys-
tem. IBM Systems Journal 21(3): 305-326, 1982.
In order to give a rough characterization of our word list, and also reproduce part of
Petersen’s results, we computed the number of possible misspellings of words in our word
list that resulted in another real word. The 257,528 words in our list have 490,656 possi-
664 FRED J. DAMERAU and ERIC MAYS
ble misspellings as another real word. There are 141,618 words that can be misspelled as
another word. Of the 490,656 misspellings, 85,747 are due to addition, 85,747 are due to
deletion, 315,524 are due to replacement, and 3638 are due to transposition. Additionally,
41,000 of the misspellings can be accounted for by simply adding or deleting an “s” at the
end of a word.
Table 6 is a statistical summary of the misspellings that were not due to misspelling
as another real word, based on additional data collected during the first phase of our exper-
iment. The statistics of the category “word division” should be treated with caution. The
input text files were processed by a number of programs that may have treated line ends
and hyphenation differently. Moreover, there is considerable disagreement as to whether
certain frozen expressions should be run together, hyphenated, or left as individual words,
e.g., “abovementioned.”
Office
PC BBS AI BBS Correspondence
Additional letter 34 15 7
Wrong letter 45 13 5
Missing letter 66 43 6
Transposition 30 12 9
Multiple error 16 13 8
Word division 86 40 54