varol2011
varol2011
a r t i c l e i n f o a b s t r a c t
Keywords: In order to assist the companies dealing with data preparation problems, an approach is developed to
Natural language handle the dirty data. Cleaning the customer records and producing the desired results require different
Census set of effective tools and sequences such as the near miss strategy and phonetic structure and edit-
Edit-distance distance to provide a suggestion table. The selection of the best match is verified and validated by the
Kullback–Leibler divergence
frequency of presence in the 20th century’s Census Bureau statistics. Although, the conducted experi-
Phonetic Strategy
Spelling correction
ments resulted in better correction rates over the well known ASPELL, JSpell HTML and Ajax Spell
Checkers, another remaining challenge is to introduce an estimation of quality factor for our Personal
Name Recognizing Strategy Model (PNRS) to distinguish between submitted original names and suggested
name estimations from PNRS. Here, we implement a statistical distance metrics for a quality measure by
computing the Kullback–Leibler distance (K–L). K–L distance can be used to measure this distance
between probability density function of original names and probability density function of suggested
names estimated from the PNRS to assess/validate to what degree our edit distance strategy has been
successful in correcting names. All submitted names as inputs of the PNRS model were taken in a max-
imum edit distance of 2 with respect to the original name. Kullback–Leibler distance will be an indicator
of name recognizing quality.
Ó 2010 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.11.112
6308 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312
is another challenge that needs to be considered. For these reasons, ularly designed for general text, some of them are used as name
the Personal Name Recognizing Strategy (PNRS) is being developed spelling correction algorithm as well. Two main categories are de-
and thus proposed to provide the closest match for misspelled fined for the techniques used for isolated-word error correction
names. below:
In this paper we will look into quality model practices on cor-
rection of misspelled names. The ultimate goal is then to define 2.1. Pattern matching techniques
similarity functions that match with human perception, but how
humans judge the similarity among submitted uncorrected names, Pattern matching techniques are commonly used in approxi-
original names, and suggested names will be a topic of ongoing mate string matching (Hall & Dowling, 1980; Jokinen, Tarhio, &
research in future. Ukkonen, 1996; Navarro, 2001), which is used for data linkage
(Christen & Goiser, 2006; Winkler, 2006), duplicate detection
(Cohen, Ravikumar, & Stephen, 2003), information retrieval (Gong
2. Literature review & Chan, 2006) and correction of spelling errors (Kukich, 1992).
Most common pattern matching technique, the edit distance is
There are a number of requirements to be met by a quality defined as the smallest number of insertions, deletions, and substi-
model in order to build enough confidence and belief that the tutions required for changing one string into another (Levenshtein,
model correctly captures quality requirements, and reflects how 1965). The edit distance from one string to another is calculated by
well those requirements have been met (Khaddaj & Horgan, the number of operations (replacements, insertions or deletions)
2004). Quality is a multidimensional construct reflected in a model, that need to be carried out to transform one string to another
where each parameter in the model defines a separate quality (Levenshtein, 1965). Minimum edit distance techniques have been
dimension. Many of the early quality models have followed a hier- applied to virtually all spelling correction tasks, including text
archical approach in which a set of factors that affect quality are editing and natural language interfaces. The spelling correction
defined with little scope for expansion (Boehm et al., 1973). accuracy varies with applications and algorithms. Damerau
Although an improvement, difficulties arise when comparing qual- (1990) reports a 95% correction rate for single-error misspellings
ity across projects, due to their tailored nature. QoS has been for a test set of 964 misspellings of medium and long words (length
widely discussed in the areas of real time applications (Clark, 5 or more characters) while using a lexicon of 1593 words.
Shenker, & Zhang, 1992), and networking (Cruz, 1995; Georgiadis, However, his overall correction rate was 84% when multi-error
Guerin, & Sivarajan, 1996). However, only a few research teams misspellings were counted. On the other hand, (Durhaiw, Lamb,
have made serious efforts to explore the field of workflow or busi- & Nad Sax, 1983) reports an overall 27% correction rate for a very
ness process automation. Crossflow (Klingemann, Wasch, & simple, fast, and plain single-error correction algorithm accessing
Anfaberer, 1999) and Meteor (Miller et al., 1999) are the leading a keyword lexicon of about 100 entries. Although the rates seem
projects in the field. Not only time dimension is considered, but low, the authors report a high degree of user satisfaction for this
also cost associated metrics was designed in these projects. How- command language interface application due to their algorithm’s
ever, these applications are restricted only to workflow generation unobtrusiveness.
part of the automation. The projects did not consider the impact of In recent work, (Brill & Moore, 2002) report experiments with
the spelling errors to the outcome of the workflow process auto- modeling more powerful edit operations, allowing generic string-
mation. Therefore, the QoS attribute for this part of the system var- to-string edits. Moreover, additional heuristics are also used to
ies and supports enhanced business process automation. complement techniques based on edit distance. For instance, in
Understanding the costumer expectation from misspelled data the case of typographic errors, the keyboard layout is very impor-
is hard to identify. The ill-defined data has different types and def- tant. It is much more common to accidentally substitute a key by
initions. The main source of errors, known as isolated-word error, is another if they are placed near each other on the keyboard.
a spelling error that can be captured simply because it is mistyped Moreover, similar measures are used to compute a distance be-
or misspelled (Nerbonne, Heeringa, & Kleiweg, 1999). As the name tween DNA sequences (strings over {A, C, G, T}), or proteins. The
suggests, isolated-word errors are invalid strings, properly identi- measures are used to find genes or proteins that may have shared
fied and isolated as incorrect representations of a valid word functions or properties and to infer family relationships and evolu-
(Becchetti & Ricotti, 1999). Typographic errors also known as ‘‘fat tionary trees over different organisms (Needleman & Wunsch,
fingering’’ has been made by an accidental keying of a letter in 1970). Other pattern matching algorithms are classified below.
place of another (Kukich, 1992). Cognitive errors refer to errors that Rule-based techniques attempt to use the knowledge gained
have been made by a lack of knowledge from the writer or typist from spelling error patterns and write heuristics that take advan-
(Kukich, 1992). Phonetic errors can be seen to be a subset of cogni- tage of this knowledge. For example, if it is known that many errors
tive errors. These errors are made when the writer substitutes let- occur from the letters ‘‘ie’’ being typed ‘‘ei’’, then we may write a
ters they believe sound correct as a word, which in fact leads to a rule that represents this (Yannakoudakis & Fawthrop, 1983).
misspelling (Jurzik, 2006). The character n-gram-based technique coincides with the charac-
There are many isolated-word error correction applications, and ter n-gram analysis in non-word detection. However, instead of
these techniques entail the problem to three sub-problems: treat observing certain bi-grams and trigrams of letters that never or
detection of an error, generation of candidate corrections and rank- rarely occur, this technique can calculate the likelihood of one
ing of candidate corrections as a separate process in sequence character following another and use this information to find possi-
(Kukich, 1992). In most cases there is only one correct spelling ble correct word candidates (Ullman, 1977).
for a particular word. However, there are often several valid possi- Longest Common Sub-String algorithm (Friedman & Sideli, 1992)
ble name combinations for a particular one, such as ‘Aaron’ and finds the longest common sub-string in the two strings being com-
‘Erin’. Also using nicknames in daily life, for instance ‘Bob’ rather pared. For example, the two name strings ‘Martha’ and ‘Marhtas’
that ‘Robert’ make matching of personal names more challenging have a longest common sub-string ‘Marta’. The total length of the
compared to general text. Many variations for approximate string common sub-strings is 5 out of 6 and 7. A similarity measure can
matching have been developed (Gong & Chan, 2006; Pfreifer, be calculated by dividing the total length of the common sub-
Poersch, & Fuhr, 1996; Zobel & Dart, 1996). Although most of the strings by the minimum, maximum or average lengths of the two
techniques that are going to be discussed in this paper are partic- original strings. As shown with the example above, this algorithm
C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312 6309
is suitable for compound names that have words swapped. The called diphthongs according to a set of rules for grouping conso-
time complexity of the algorithm, which is based on a dynamic nants and then maps groups to metaphone codes.
programming approach [x], is O(js1j js2j) using O(min(js1j, js2j)) The most closest one to our approach is, Syllable alignment dis-
space. tance (Gong & Chan, 2006) is based on the idea of matching two
Jaro (Yancey, 2005) is an algorithm commonly used in data link- names syllable by syllable, rather than character by character. It
age system. Recently this algorithm is being used to measure the uses Phonix transformations as a preprocessing step and then ap-
distances between the words or names! The Jaro distance metric plies rules to find the beginning of syllables. The distance between
states that given two strings s1 and s2, their distance dj is: two strings of syllables is then calculated using an edit distance
based approach.
1 m m mt
dj ¼ þ þ ð1Þ
3 js1 j js2 j m
3. Methodology
where, m is the number of matching characters and t is the number
of transpositions. 3.1. Personal Name Recognizing Strategy
The Winkler (Yancey, 2005) algorithm improves over the Jaro
algorithm by applying ideas based on empirical studies which Personal Name Recognizing Strategy (PNRS) (Varol, Bayrak,
found that fewer errors typically occur at the beginning of names. Wagner, & Goff, 2010) is based on the results of number of strate-
The Winkler algorithm therefore increases the Jaro similarity mea- gies that are combined together in order to provide the closest
sure for agreeing on initial characters (up to four). match (Fig. 1). The near miss strategy and phonetic strategy are
Naturally, n-grams can be used to calculate probabilities and used to provide the suggestions at the identical time and weight,
this has led to the probabilistic techniques demonstrated by Lee since the input data heavily involves with names and possible
(1999). In particular, transition probabilities can be trained using international scope; it is difficult to standardize the phonetic
n-grams from a large corpus and these n-grams can then represent equivalents of certain letters. Once we have a list of suggestions,
the likelihood of one character following another. Confusion prob- an edit-distance algorithm is used to rank the results in the pool.
abilities state the likelihood of one character being mistaken for In order to provide meaningful suggestions, the threshold value, t
another. Neural net techniques have emerged as likely candidates is defined as 2. In cases where the first and last characters of a
for spelling correctors due to their ability to do associative recall word did not match, we modified our approach to include an extra
based on incomplete and noisy data. This means that they are edit distance. The main idea behind this approach is that people
trained on the spelling errors themselves and carry the ability to generally can get the first character and last character correct
adapt to the specific spelling error patterns that they are trained when trying to spell a word. Another decision mechanism is
upon (Trenkle & Vogt, 1994). needed to suggest the best possible solution for individual names.
The rationale behind this idea is that at the final stage there is a
possibility to come up with several possible candidate names for
2.2. Phonetic techniques a mistyped word that has one or two edit distance from the origi-
nal word. Relying on edit distance often does not provide the de-
All phonetic encoding techniques attempt to convert a name sired result. The US Census Bureau study, which has compiled a
string into a code according to the way a name is pronounced. list of popular first and last names that are scored based on the fre-
Therefore, this process is language dependent. Most of the de- quency of those names within the United States (Census, 1990), is
signed techniques have been mainly developed based on English added as the decision making mechanism. This allows the tool to
phonetic structure. However, several other techniques have been choose a ‘‘best fit’’ suggestion to eliminate the need for user inter-
designed for other languages as well (Christen, 2006). action. The strategies are applied to a custom dictionary which is
The SOUNDEX (Philips, 1990), which is used to correct phonetic designed particularly for the workflow automation.
spellings, maps a string into a key consisting of its first letter fol-
lowed by a sequence of digits. All vowels, ‘h’, ‘w’ and ‘y’ are re- 3.1.1. Correction rates of PNRS
moved from the sequences to get the Soundex code of a word. The experimental data is a real-life sample which involves per-
Then it takes the remaining English word and produces a four-digit sonal and associated company names, addresses including zip
representation (shorter codes are extended with zeros), which is a
primitive way to preserve the salient features of the phonetic pro-
nunciation of the word. A major problem with Soundex is that it
keeps the first letter, thus any error at the beginning of a name will
result in a different Soundex code.
Phonex (Lait & Randell, 1993) tries to improve the encoding
quality by pre-processing names according to their English pro-
nunciation before the encoding. All trailing ‘s’ are removed and
various rules are applied to the leading part of a name (for example
‘kn’ is replaced with ‘n’, and ‘wr’ with ‘r’). Like Soundex, the leading
letter of the transformed name string is kept and the remainder is
encoded with numbers (1 letter, 3 digits).
Phonix algorithm is an improvement for the Phonex and applies
more than one hundred transformation rules on groups of letters
(Gadd, 1990). Some of these rules are limited to the beginning of
a name, some to the end, others to the middle and some will be ap-
plied anywhere.
The Metaphone algorithm is also a system for transforming
words into codes based on phonetic properties (Philips, 1990). Un-
like Soundex, which operates on a letter-by letter scheme, meta-
phone analyzes both single consonants and groups of letters Fig. 1. PNRS Strategy.
6310 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312
code, city, state, and phone numbers individuals. Dirty data is pres- Table 2
ent in total of 4000 records, including misspelled names and some Minimum and maximum correction rates of the algorithms.
non-ASCII characters 2,648 records were correctly fixed by PNRS, Strategy Minimum and maximum fixed j
119 records were identified as valid names, while 1233 records Matched percentage for test case
were corrected but produced different names as reflected in Soundex 49.9–52.2%
Fig. 2 and Table 1. Phonex 48.7–51.3%
Phonix 48.8–49.1%
DMetaphone 51.9–52.6%
Fixed j Matched ? Exact correction of the misspelled name. Levenshtein edit-distance 41.2–55.1%
Fixed j No Match ? Corrections that provide no match with the LCS 57.7–64.3%
original name. Jaro–Winkler 64.4–65.6%
Match j No Match ? Either the input is accepted as a valid name 2-Grams 57.2–61.5%
3-Grams 52.2–58.3%
or the system failed to provide any suggestions.
PNRS 66.2%
Fig. 3. The K–L distance role in name retrieval system architecture for estimation of quality of service.
q
!
X
2 X
1 233
ðq Þ pki For example ‘‘Jason’’ comes from the query model p(X; hq) which
DðpðqÞ kpðPNRSÞ ¼ pk i log PNRS generates all query names x = (x1, x2, . . . ,xL) 2 X from Census data.
i¼1 k¼1 pk i
! The introduced K–L model is developed by the probability dis-
X
2 X
1233
q q
X
1233
q PNRSi tributions of edit-distance and frequency of usage of the names.
¼ pki log pki pki log pk ð4Þ
Also we introduce PNRS Distance Metric (PNRSDM) as to calculate
i¼1 k¼1 k¼1
0.016
Histogram of Census Frequency
0.014
0.012
0.01
Original Names
0.008
PNRS Suggestion
0.006
0.004
0.002
0
1 206 411 616 821 1026 1231
Number of Records
1
Histogram of Edit
0.8 Distance for
suggested Names
0.6
Histogram of Edit
0.4 Distance for
original Names
0.2
0
1 2
Edit Distance
Fig. 4. (a) The histogram of the PNRS and original records, (b) histogram of edit distances of the PNRS and original records, (c) probability density functions of the PNRS and
original records.
6312 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312
the similarity between two names based on the character Cruz, R. L. (1995). Quality of service guarantees in virtual circuit switched networks.
IEEE Journal of Selected Areas Communications, 13(6), 1048–1056.
difference.
Damerau, F. J. (1990). Evaluating computer generated domain-oriented
( ) vocabularies. Information Processing and Management, 26, 791–801.
1Xi¼2
LCS Durhaiw, I., Lamb, D. A., & Nad Sax, J. B. (1983). Spelling correction in user
PNRSDM ¼ n ¼ 2; if ti ¼ 0 ð5Þ
n i¼1 jSi j interfaces. ACM, 26, 764–773.
Friedman, C., & Sideli, R. (1992). Tolerating spelling errors during patient validation.
( ) Computers and Biomedical Research, 25, 486–509.
i¼2
1X LCS jSi j ti Gadd, T. (1990). PHONIX: The algorithm. Program: Automated Library and
PNRSDM ¼ þ n ¼ 4; if ti > 0 ð6Þ Information Systems, 24(4), 363–366.
n i¼1 jSi j jSi j Georgiadis, L. R., Guerin, V. P., & Sivarajan, K. (1996). Efficient network QoS
provisioning based on per node traffic shaping. IEEE ACM Transactions on
where, LCS is the number of the longest string (or strings) that is a Networking, 4(4), 482–501.
substring (or are substrings) of two or more strings, Si is the number Gong, R., & Chan, T. K. (2006). Syllable alignment: A novel model for phonetic string
search. IEICE Transactions on Information and Systems, E89-D(1), 332–339.
of characters in the ith element, and ti is the number of transposi- Hall, A., & Dowling, G. R. (1980). Approximate string matching. ACM Computing
tions in the ith element. PNRSDM is a normalized similarity mea- Surveys, 12(4), 381–402.
sure between 1.0 (strings are the same) and 0.0 (strings are Jokinen, P., Tarhio, J., & Ukkonen, A. (1996). A comparison of approximate string
matching algorithms. Software – Practice and Experience, 26(12), 1439–1458.
totally different). Jurzik, H. (2006). The Ispell and Aspell command line spellcheckers. <www.linux-
magazine>, 85. issue, 63–66.
Khaddaj, S., & Horgan, G. (2004). The evaluation of software quality factors in very
4. Discussions and conclusion
large information systems. Electronic Journal of Information Systems Evaluation,
7(1), 43–48.
In order to evaluate the PNRS and statistical distance metrics Klingemann, J., Wasch, J., & Anfaberer, K. (1999). Deriving service models in cross-
organizational workflows. In Proceedings of RIDE – information technology for
and understand how close we are to the ‘‘Fixed’’ names from our
virtual enterprises (RIDE-VE ’99), Sydney, Australia (pp. 100–107).
No Match names list as mentioned 4,000 misspelled names are Kukich, K. (1992). Techniques for automatically correcting words in text. ACM
used. PNRS corrected 66.2% correctly, which did perform better Computing Surveys, 24(4).
than other mentioned techniques. On the other hand, the distance Lait, A., & Randell, B. (1993). An assessment of name matching algorithms. Technical
report, Department of Computer Science, University of Newcastle upon Tyne,
between the original names and suggested names is defined as 1993.
joint KLD. We found out that the joint KLD is simply the sum of edit Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th annual
distance and Census PNRS Frequency KLDs acquired from each meeting of the ACL.
Levenshtein, VI. (1965). Binary codes capable of correcting deletions, insertions and
independent data set. We obtained KL = 4.9195 under the histo- reversals. Doklady Akademii Nauk SSSR, 163, 845–848. also (1966) Soviet Physics
gram of Census PNRS Frequency (Fig. 4(a)) and histogram for edit Doklady 10, 707–710.
distances for suggested names against the original query names Miller, J. A., Fan, M., Wu, S., Arpinar, I. B., Sheth, A. P., & Kochut, K. J. (1999). Security
for the Meteor workflow management system, UGA-CS-LDIS Technical Report,
(Fig. 4(b)). Moreover, probability density functions for each vector University of Georgia.
are presented in Fig. 4(c). The PNRS Distance Metric averaged Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing
0.483927 out of 1233 records. Surveys, 33(1), 31–88.
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search
for similarities in the amino acid sequence of two proteins. Journal of Molecular
Acknowledgments Biology, 48, 443–453.
Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity.
In D. Sanko & J. Kruskal (Eds.), Time warps, string edits and macromolecules: The
The authors are to extend recognition to Dr. Rustu Murat
theory and practice of sequence comparison (2nd ed., pp. v–xv). Stanford: CSLI.
Demirer for being instrumental during the implementation stage. Philips, L. (1990). Hanging on the metaphone. Computer Language, 7(12), 39–43.
Pfreifer, U., Poersch, T., & Fuhr, N. (1996). Retrieval effectiveness of proper name
References search methods. Information Processing and Management, 32(6), 667–679.
Trenkle, J. M., & Vogt, R. C. (1994). Disambiguation and spelling correction for a
neural network based character recognition system. In Proceedings of SPIE. (vol.
Becchetti, C., & Ricotti, L. P. (1999). Speech recognition: Theory and C++ 2181, pp. 322–333).
implementation. John Wiley & Sons. Ullman, J. R. (1977). A binary n-gram technique for automatic correction of
Bilgi, B. (2003). Using Kullback–Leibler distance for text categorization. In substitution, deletion, insertion, and reversal errors in words. Computer Journal,
Proceedings of the ECIR-2003. Lecture notes in computer science (Vol. 2633, 20(2), 141–147.
pp. 305–319). Springer-Verlag. Varol, C., Bayrak, C., Wagner, R., & Goff, D. (2010). Application of near miss strategy
Boehm, B. W, Brown, J. R, Kaspar, H., Lipow, M., Macleod, G., & Merritt, M. J. (1973). and edit distance to handle dirty data. In Data engineering: Mining, information,
Characteristics of Software Quality, TRW Software Series TRW-SS-73-09. and intelligence. Springer.
Brill, E., & Moore, R. C. (2002). An improved error model for noisy channel spelling Varol, C., & Bayrak, C. (2005). Applied software engineering education. In ITHET
correction. In Proceedings of ACL-2000, the 38th annual meeting of the association 2005, July 6–9, 2005, Santa Domingo, Dominican Republic.
for computational linguistics (pp. 286–293). Varol, C., Bayrak, C., Ludwing, R. (2005). Application of software engineering
Census Bureau Home Page (1990). <www.cencus.gov>. fundamentals: A hands of experience. In The 2005 international conference on
Christen, P. (2006). A comparison of personal name matching: Techniques and software engineering research and practice, June 27–30, 2005, Las Vegas, Nevada,
practical issues. ICDM Workshops 2006 (pp. 290–294). USA.
Christen, P., Churches, T., & Hegland, M. (2004). Febrl – A parallel open source data Winkler, W. E. (2006). Overview of record linkage and current research directions.
linkage system. In PAKDD, Springer LNAI 3056, Sydney 2004 (pp. 638–647). Technical Report RR2006/02, US Bureau of the Census.
Christen, P., & Goiser, K. (2006). Quality and complexity measures for data linkage Yancey, W. E. (2005). Evaluating string comparator performance for record linkage.
and reduplication. In F. Guillet & H. Hamilton (Eds.), Quality measures in data Technical Report RR2005/05, US Bureau of the Census.
mining. Studies in computational intelligence. Springer. Yannakoudakis, E. J., & Fawthrop, D. (1983). The rules of spelling errors. Information
Clark, D., Shenker, S., & Zhang, L. (1992). Supporting real-time applications in an Processing Management, 19(2), 87–99.
integrated services packet network: Architecture and mechanism. In Zobel, J., & Dart, P. (1996). Phonetic string matching: Lessons from information
Proceedings of ACM SIGCOMM (pp. 14–26). retrieval. In Proceedings of ACM SIGIR, Zurich, Switzerland (pp. 166–172).
Cohen, W. W., Ravikumar, P., & Stephen, E. F. (2003). A comparison of string distance
metrics for name-matching tasks. In Proceedings of IJCAI-03 workshop on
information integration on the web, Acapulco (pp. 73–78).