0% found this document useful (0 votes)
15 views

damerau1989_an_examination_of_undetected_typing_errors

Uploaded by

Leonardo Araujo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

damerau1989_an_examination_of_undetected_typing_errors

Uploaded by

Leonardo Araujo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Informorion Processing & Manogemenr Vol. 25, No. 6, pp. 659-664, 1989 03064573/89 $3.00 + .

oO
Printed in Great Britain. Copyright 0 1989 Pergamon Press plc

AN EXAMINATION OF UNDETECTED TYPING ERRORS

FRED J. DAMERAU and ERIC MAYS


Mathematical Sciences Department, IBM Thomas J. Watson Research Center,
P.O. Box 218, Yorktown Heights, NY 10598

(Received 10 August 1988; accepted in final form 10 October 1988)

Abstract-We examine the effect of increasing word list size on the error rate of spell-
ing correctors. An experiment on a large body of text shows that an increase in the word
list size decreases the error rate.

1. INTRODUCTION

A typical spelling correction program detects spelling errors by searching through a (vir-
tual)* list of correctly spelled words. A word is determined to be spelled correctly only
when it occurs in the list. Since some simple misspellings of words result in correct words,
this technique will fail to recognize certain incorrectly spelled words. In the sentence, The
Mets beta the Red Sox to win the 1986 World Series, the simple misspelling of beat as beta
will go undetected. Peterson [l] made a study of the number of possible typing errors that
would go undetected as the size of the list of correctly spelled words increases. As more
words are added to the list, fewer correctly spelled words will be detected as misspellings,
but more incorrectly spelled words will pass as correct words. If a word list did not con-
tain beta, the misspelling of beat would be detected, but any correct uses of beta would
be determined to be misspelled. Since the possible undetected typing errors as a fraction
of the possible typing errors ranges from 10 percent for a 50,000 word list to 15 percent
for a 350,000 word list, there exists a possible problem in using large word lists for spell-
ing correction [l]. In this paper, we report on the results of an investigation into the actual
occurrence of these possible errors in a large body of text.
We hypothesize that the number of possible undetected errors is significantly higher
than their actual occurrence. That is, the kinds of errors that are actually made by peo-
ple are governed by some principle that differs significantly from random behavior. We
performed a two phase experiment on a sample of over 20 million words of text. The first
phase listed a sample of sentences that contained a word of rank** greater than 5000.
The second phase listed all sentences that contained a word whose rank was between 50,000
and 60,000 and had a misspelling with rank below 50,000. A slightly modified second phase
listed all sentences containing a word whose rank was between 50,000 and 60,000. (In our
text sample, the words below rank 5000 accounted for 88 percent of the words in the text
and the words below rank 60,000 accounted for 99.99 percent of the words in the text.)
These sentence lists were then manually examined to find actual misspellings. We believe
the hypothesis was verified.
For the purposes of this study (and Petersen’s study [l]), a misspelling is more pre-
cisely a mistyping as defined by Damerau [2]. A word may be mistyped as a result of
exactly one of the following mistakes:

l Adding an extra letter, e.g., mistyping the as thea


l Deleting a letter, e.g., the as th
l Replacing a letter, e.g. the as ahe
l Transposing two adjacent letters, e.g. the as hte

*Some implementations may do morphological analysis or employ compact encodings, but the actual data
structure is irrelevant.
**The rank of a word is its position in a list of words sorted by descending frequency.

659
660 FRED J. DAMERAU and ERIC MAYS

Errors of this kind typically account for over 80 percent of all errors [2], [l], although spe-
cial circumstances may show different statistics. A recent paper by Mitton [3] presents data
from English school compositions showing quite different percentages. The data come
from students, many of whom appear to have only a sketchy idea of correct spelling. It
is clear from this paper that standard spelling correction algorithms would be of little use
to this population. We assume that commercially important applications do not have this
characteristic.

2. SOURCES

The following sources were used to build our list of correctly spelled words.

l IBM’s Proof spelling corrector


l Webster’s 7th New Collegiate Dictionary
l Webster’s 2nd International Dictionary
l An English-French translation dictionary
l An English-German translation dictionary
l The Epistle [4] dictionary
l Longman’s Dictionary of Contemporary English

From these lists we eliminated possessives (words ending in ‘s or s’), and proper nouns
(words containing capital letters). The resulting merged word list contained 257,528 words,
all presumed correctly spelled.
The following sources were used both to determine word rank and as a source of
potential misspellings.

Transcripts of the proceedings of the Canadian Parliament. (Canada in tables.)


Texts of novels and magazines. (Novels in tables.) This source was used only to
determine word rank, since it had been extensively proofread.
An IBM internal bulletin board about the IBM PC. (PC in tables.)
Arpanet bulletin boards about Artificial Intelligence, Natural Language, and Pro-
log. (AZ in tables.)
IBM business office correspondence. (Office in tables.)

Table 1 lists for each text source the number of words (Words heading), the number
of unique words (Unique heading), and the number of unique words that appeared in the
merged wordlist (Unique verified heading). The number of words in the IBM PC and
Arpanet AI samples varied among the phases in our experiment. The actual number of
words in each phase will be noted accordingly. Note that we had no control over the
amount of proofreading or automatic spell checking that was applied to any of the sources.
However, our informal queries of people that submit to the bulletin boards leave us with
the impression that very little proofreading or spell checking occurs.
The combined sources contained 61,519 unique words occurring in the list of verified
words. We collected word frequency data and assigned a word rank for each source inde-
pendently. Since some sources contained words ranking high only within that source, a
method for removing that bias was needed to obtain a reasonable combined ranking. If
each of the ranks differed from the mean by no more than the mean raised to 0.85, the

Table 1. Text sources

Source Words Unique Unique verified

Novels 7,687,288 63,290 53,141


Office 1,178,617 14,389 12,636
AI 2,032,240 30,524 21,947
Canada 3,252,862 25,666 23,507
PC 6,528,181 43,109 26,513
An examination of undetected typing errors 661

mean was used. Otherwise, if a single rank caused the mean to fall outside of the 90 per-
cent level of confidence using a Student’s t Distribution, that rank was not used in com-
puting the mean.

3. EXPERIMENT

In the first phase of the experiment we sampled sentences that contained a word with
a rank greater than 5000. This sample consisted of 14,917 sentences from office correspon-
dence, 17,074 sentences from the Arpanet, 11,567 sentences from the Canadian Parliament,
and 22,585 sentences from the PC bulletin board. Based on a manual examination of the
data, we compiled a list of the incorrect spellings. All of the sentences containing these mis-
spelled words were then exhaustively listed from the sources. For this phase the AI source
contained 2,418,279 words and the PC source contained 9,010,623 words.
The result of the examination of these sentences for incorrect spellings is found in
Tables 2 and 3. The tables list the misspelling (Typo heading), the intended word ( Word

Table 2. First phase data: No correct uses

Office AI Canada PC

TYPO Word C T C T C T C T

aam A.M. 0 2
adz ads 0 1
ait it 0 1
cant can’t 0 1 0 8 0 68
charied chaired 0 2
col=lections collections 0 2 0 4
dont don’t 0 18 0 156
fip flip 0 1
firs first 0 1 0 1
havent haven’t 0 1 0 19
i=ll I’11 0 1
lier liar 0 1 0 5
mom0 memo 1* 2
od do 0 1 0 1
Pease please 0 2
quires requires 0 1
sill still 0 1
stree street 0 1
te the 0 1 0 2 0 4 0 3
tempi tempo 0 1
sixteen=th sixteenth 0 1
th the 0 8 0 2 0 g 0 44
wi=th with 0 6 0 1
thats that’s 0 9 ?
until1 until 0 12
whar what 0 1
whats what’s 0 1 ?

Table 3. First phase data: Some correct uses

Office AI Canada PC

TYPO Word C T C T C T C T

origin=ally originally 0 1 1 1
ant and 6 6 0 1 1 3
coping copying 9 9 7 7 6 10
dependant dependent 0 3 1 2 0 52
discrete discrete 1 2 2 4
equip=ment equipment 3 5 4 4 13 13 2 2
lets let 0 1
lets let’s 0 12 33 44 I 8 299 368
wont won’t 1 3 2 2 1 35
662 FRED J. DAMERAUand ERIC MAYS

heading), and by each source the number of correct uses (C heading) and total uses
(T heading). An extra space in a word in the Typo column is indicated by an equal sign.
Table 3 lists misspellings for which no correct uses of the misspelled word occurred.
It is thus questionable whether these misspelled words should be in the word list. Many are
archaic uses. A few suffixes were in our master word list, and would not appear in the
word list of a spell checker. Note the large number of errors due to omission of the apos-
trophe in contractions. The one “correct” use of memo was due to a reference in corre-
spondence to the previous misspelling. (Zt ain’t a morno, it’s a memo.)
Table 2 consists of misspellings for which there were occurrences of correct uses.
Again a large number of misspellings are due to omission of apostrophe in contractions.
Several errors are due to word division. Note that for most words a significant number of
correct uses occurred. Thus there would be a real penalty in omitting these words from a
word list. For example, if coping were omitted from the word list, the 4 misspellings of
copying would be detected. However, the 22 correct uses of coping would be flagged as
misspelled.
In the second phase we listed all sentences that contained a word whose rank was
between 50,000 and 60,000 and had a possible misspelling resulting in a word with rank
less than 50,000. By doing this, we hoped to determine the incremental effect of adding
10,000 words, since words having no possible misspellings with rank less than 50,000 have
no effect. Before eliminating words without potential misspellings, this would result in the
consideration of 10,000 words. However, there was a very large grouping of words of iden-
tical rank just less than 50,000. Thus, only 881 words were considered, due to the group-
ing and misspelling restriction. In this phase the AI source contained 1,637,234 words, and
the PC source contained 15,842,241 words.
As a slight modification we listed all sentences that contained a word whose rank was
between 50,000 and 60,000. This time, however, we corrected the oversight of not including
the group of words just less than rank 50,000, thus we considered 9806 words. In this phase
the AI source contained 2,569,975 words, and the PC source contained 15,661,973 words.
Since our restricted form of misspelling is symmetric (if word A can be misspelled as word
B, then B can be misspelled as A), this allows us to determine the effect of adding the next
10,000 words to a 50,000 word list.
The data appear in Tables 4 and 5. In the modified phase, we found 19 errors in a
22,663,427 word sample, and 3240 correct uses of the 9806 additional words. That is,

Table 4. Second ohase data

Office (202 words occurred):


Yor -> your
AI (240 words occurred):
stong -> strong
Canada (715 words occurred):
manta1 -> mental
merk -> mark
neigher -> neither
nutritions -> nutritious
parfy -> partly
repot -> report
resue -> rescue
speel -> spell
substraction -> subtraction
tee1 -> feel
throuch -> through
trike -> strike
PC (214 words occurred):
cert=ain -> certain
comportment -> compartment
mutch -> much
noy -> not (2)
stong -> strong (2)
yor -> your (2)
An examination of undetected typing errors 663

Table 5. Second phase data, modified

Office (361 words occurred):


yor -> your
Al (1168 words occurred):
stong -> strong
Canada (655 words occurred):
PC (1075 words occurred):
bur -> but
hows -> how’s (2)
lown -> loan
mutch -> much
sone -> some (4)
stong -> strong (2)
strang -> strange (2)
tapa -> tape
toted -> touted
yam -> am
yor -> your

adding the next 10,000 words to a 50,000 word dictionary eliminates 3240 incorrect mis-
spelling flags, at the cost of missing 19 actual cases of misspelling. This is a ratio of
approximately 150 to 1. The data in Table 4 show 23 errors in a 21,910,954 word sample,
and 1348 correct uses. To detect these 23 errors, an additional 1348 words would be flagged
as incorrect, a ratio of approximately 50 to 1 -even with the elimination of words with-
out potential misspellings.

4. CONCLUSION

On the one hand, the number of words that are misspelled as real words (and thus not
flagged by the spelling corrector) increases as the size of the word list grows. On the other
hand, correct words that are incorrectly flagged as misspelled increase as the word list size
shrinks. In Petersen’s [l] conclusions it is suggested that “word lists used in spelling pro-
grams should be kept small; a large word list is not necessarily a better word list.” Our
study challenges this conclusion. In a spelling corrector which employed a 50,000 word list
we found the resulting differential error rate was in one case 50, and in a second case 150,
times greater than when a 60,000 word list was used.
Until content based spelling correctors are available, a useful technique might be to
flag infrequent words as potentially misspelled. These infrequent words could appear (ap-
propriately noted) as possible corrections. For example, comportment would be flagged
as potentially misspelled. The list of possible corrections would be comportment (low fre-
quency) and compartment.

Acknowledgements- James Griesmer and Warren Plath provided useful comments on an earlier draft of this
paper.

REFERENCES

1. Petersen, J.L. A note on undetected typing errors. Communications of the ACM 29(7): 633-637, July 1986.
2. Damerau, F.J. A technique for computer detection and correction of spelling errors. Communications of the
ACM 7(3): 171-176, March 1964.
3. Mitton, Roger. Spelling checkers, spelling correctors, and the misspellings of poor spellers. Information Pro-
cessing and Management 23(5): 495-505, 1987.
4. Heidorn, G.E., Jensen, K., Miller, L.A., Byrd, R.J., and Chodorow, M.S. The Epistle text-critiquing sys-
tem. IBM Systems Journal 21(3): 305-326, 1982.

APPENDIX A: CHARACTERIZATION OF POSSIBLE MISSPELLINGS

In order to give a rough characterization of our word list, and also reproduce part of
Petersen’s results, we computed the number of possible misspellings of words in our word
list that resulted in another real word. The 257,528 words in our list have 490,656 possi-
664 FRED J. DAMERAU and ERIC MAYS

ble misspellings as another real word. There are 141,618 words that can be misspelled as
another word. Of the 490,656 misspellings, 85,747 are due to addition, 85,747 are due to
deletion, 315,524 are due to replacement, and 3638 are due to transposition. Additionally,
41,000 of the misspellings can be accounted for by simply adding or deleting an “s” at the
end of a word.

APPENDIX B: MISSPELLING DATA

Table 6 is a statistical summary of the misspellings that were not due to misspelling
as another real word, based on additional data collected during the first phase of our exper-
iment. The statistics of the category “word division” should be treated with caution. The
input text files were processed by a number of programs that may have treated line ends
and hyphenation differently. Moreover, there is considerable disagreement as to whether
certain frozen expressions should be run together, hyphenated, or left as individual words,
e.g., “abovementioned.”

Table 6. Errors in words not in master word list

Office
PC BBS AI BBS Correspondence

Additional letter 34 15 7
Wrong letter 45 13 5
Missing letter 66 43 6
Transposition 30 12 9
Multiple error 16 13 8
Word division 86 40 54

You might also like