0% found this document useful (0 votes)
2 views

varol2011

The document discusses a developed approach for improving the quality of service in spelling correction, particularly for personal names, using Kullback–Leibler divergence as a statistical distance metric. It highlights the challenges of handling dirty data and the importance of accurate name recognition in data processing. The Personal Name Recognizing Strategy (PNRS) is introduced as a method to suggest corrections for misspelled names by utilizing various algorithms and techniques, achieving better correction rates compared to existing spell checkers.

Uploaded by

vuredgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

varol2011

The document discusses a developed approach for improving the quality of service in spelling correction, particularly for personal names, using Kullback–Leibler divergence as a statistical distance metric. It highlights the challenges of handling dirty data and the importance of accurate name recognition in data processing. The Personal Name Recognizing Strategy (PNRS) is introduced as a method to suggest corrections for misspelled names by utilizing various algorithms and techniques, achieving better correction rates compared to existing spell checkers.

Uploaded by

vuredgg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Expert Systems with Applications 38 (2011) 6307–6312

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Estimation of quality of service in spelling correction using


Kullback–Leibler divergence
Cihan Varol a,⇑, Coskun Bayrak b
a
Computer Science Department, Sam Houston State University, 1903 Ave. I, Huntsville, TX 77341, USA
b
Computer Science Department, University of Arkansas at Little Rock, 2801 S. University Ave., Little Rock, AR 72212, USA

a r t i c l e i n f o a b s t r a c t

Keywords: In order to assist the companies dealing with data preparation problems, an approach is developed to
Natural language handle the dirty data. Cleaning the customer records and producing the desired results require different
Census set of effective tools and sequences such as the near miss strategy and phonetic structure and edit-
Edit-distance distance to provide a suggestion table. The selection of the best match is verified and validated by the
Kullback–Leibler divergence
frequency of presence in the 20th century’s Census Bureau statistics. Although, the conducted experi-
Phonetic Strategy
Spelling correction
ments resulted in better correction rates over the well known ASPELL, JSpell HTML and Ajax Spell
Checkers, another remaining challenge is to introduce an estimation of quality factor for our Personal
Name Recognizing Strategy Model (PNRS) to distinguish between submitted original names and suggested
name estimations from PNRS. Here, we implement a statistical distance metrics for a quality measure by
computing the Kullback–Leibler distance (K–L). K–L distance can be used to measure this distance
between probability density function of original names and probability density function of suggested
names estimated from the PNRS to assess/validate to what degree our edit distance strategy has been
successful in correcting names. All submitted names as inputs of the PNRS model were taken in a max-
imum edit distance of 2 with respect to the original name. Kullback–Leibler distance will be an indicator
of name recognizing quality.
Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction or the like, knowledge of the accuracy of each individual piece of


data as well as the aggregate accuracy of the whole, can help both
In today’s information age, processing customer information in in making use of the data, and determination of the effectiveness
a standardized and accurate manner is known to be a difficult task. of various methods of data collection. Problems arise when a large
Data collection methods vary from source to source by format, vol- amount of data is collected and each piece is subject to some kind
ume, and media type. Therefore, it is advantageous to deploy cus- of control mechanism. For a human being carrying out this task
tomized data hygiene techniques to standardize the data for could take days, months or even years depending on the amount
meaning fullness and usefulness based on the organization. of data. Moreover, labor work is error-prone and may lead to differ-
According to a study conducted at University of Berkeley, 92% of ent results for different investigators (Varol, Bayrak, & Ludwing,
the new information, such as personal information, was stored on 2005). Automating the process as much as possible obviously min-
magnetic media or in the digital form. Also, the World Wide Web imizes the amount of time researchers must devote to the task.
contains about 170 petabytes of information, and significant num- The problem of devising algorithms and techniques for auto-
ber of the information is in text form. Any set of data may be consid- matically correcting words in text has become a perennial research
ered to have a level of accuracy which directly impacts the challenge since 1960s. However, often, the use of a tool that has
usefulness of that data. Names are also important pieces of informa- limited capabilities to correct the mistyped information can cause
tion when databases are duplicated and when data sets are linked or many problems. Moreover, most of these techniques are particu-
integrated and no unique entity identifiers are available (Christen, larly implemented for the words used in daily conversations. Since
Churches, & Hegland, 2004). Therefore, the more accurate a piece personal names have different characteristics compared to general
of information, the more one may depend on it (Varol & Bayrak, text, a composite algorithm which employs phonetic encoding,
2005). When collecting data from surveys, advertising campaigns string matching and statistical facts need to be developed based
on the potential sources of variations and errors that occur in a
⇑ Corresponding author. Tel.: +1 936 2943930; fax: +1 936 2944312. name. In addition, understanding the impact on customer satisfac-
E-mail address: [email protected] (C. Varol). tion caused by ill-defined/dirty data (misspelled or mistyped data)

0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.11.112
6308 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312

is another challenge that needs to be considered. For these reasons, ularly designed for general text, some of them are used as name
the Personal Name Recognizing Strategy (PNRS) is being developed spelling correction algorithm as well. Two main categories are de-
and thus proposed to provide the closest match for misspelled fined for the techniques used for isolated-word error correction
names. below:
In this paper we will look into quality model practices on cor-
rection of misspelled names. The ultimate goal is then to define 2.1. Pattern matching techniques
similarity functions that match with human perception, but how
humans judge the similarity among submitted uncorrected names, Pattern matching techniques are commonly used in approxi-
original names, and suggested names will be a topic of ongoing mate string matching (Hall & Dowling, 1980; Jokinen, Tarhio, &
research in future. Ukkonen, 1996; Navarro, 2001), which is used for data linkage
(Christen & Goiser, 2006; Winkler, 2006), duplicate detection
(Cohen, Ravikumar, & Stephen, 2003), information retrieval (Gong
2. Literature review & Chan, 2006) and correction of spelling errors (Kukich, 1992).
Most common pattern matching technique, the edit distance is
There are a number of requirements to be met by a quality defined as the smallest number of insertions, deletions, and substi-
model in order to build enough confidence and belief that the tutions required for changing one string into another (Levenshtein,
model correctly captures quality requirements, and reflects how 1965). The edit distance from one string to another is calculated by
well those requirements have been met (Khaddaj & Horgan, the number of operations (replacements, insertions or deletions)
2004). Quality is a multidimensional construct reflected in a model, that need to be carried out to transform one string to another
where each parameter in the model defines a separate quality (Levenshtein, 1965). Minimum edit distance techniques have been
dimension. Many of the early quality models have followed a hier- applied to virtually all spelling correction tasks, including text
archical approach in which a set of factors that affect quality are editing and natural language interfaces. The spelling correction
defined with little scope for expansion (Boehm et al., 1973). accuracy varies with applications and algorithms. Damerau
Although an improvement, difficulties arise when comparing qual- (1990) reports a 95% correction rate for single-error misspellings
ity across projects, due to their tailored nature. QoS has been for a test set of 964 misspellings of medium and long words (length
widely discussed in the areas of real time applications (Clark, 5 or more characters) while using a lexicon of 1593 words.
Shenker, & Zhang, 1992), and networking (Cruz, 1995; Georgiadis, However, his overall correction rate was 84% when multi-error
Guerin, & Sivarajan, 1996). However, only a few research teams misspellings were counted. On the other hand, (Durhaiw, Lamb,
have made serious efforts to explore the field of workflow or busi- & Nad Sax, 1983) reports an overall 27% correction rate for a very
ness process automation. Crossflow (Klingemann, Wasch, & simple, fast, and plain single-error correction algorithm accessing
Anfaberer, 1999) and Meteor (Miller et al., 1999) are the leading a keyword lexicon of about 100 entries. Although the rates seem
projects in the field. Not only time dimension is considered, but low, the authors report a high degree of user satisfaction for this
also cost associated metrics was designed in these projects. How- command language interface application due to their algorithm’s
ever, these applications are restricted only to workflow generation unobtrusiveness.
part of the automation. The projects did not consider the impact of In recent work, (Brill & Moore, 2002) report experiments with
the spelling errors to the outcome of the workflow process auto- modeling more powerful edit operations, allowing generic string-
mation. Therefore, the QoS attribute for this part of the system var- to-string edits. Moreover, additional heuristics are also used to
ies and supports enhanced business process automation. complement techniques based on edit distance. For instance, in
Understanding the costumer expectation from misspelled data the case of typographic errors, the keyboard layout is very impor-
is hard to identify. The ill-defined data has different types and def- tant. It is much more common to accidentally substitute a key by
initions. The main source of errors, known as isolated-word error, is another if they are placed near each other on the keyboard.
a spelling error that can be captured simply because it is mistyped Moreover, similar measures are used to compute a distance be-
or misspelled (Nerbonne, Heeringa, & Kleiweg, 1999). As the name tween DNA sequences (strings over {A, C, G, T}), or proteins. The
suggests, isolated-word errors are invalid strings, properly identi- measures are used to find genes or proteins that may have shared
fied and isolated as incorrect representations of a valid word functions or properties and to infer family relationships and evolu-
(Becchetti & Ricotti, 1999). Typographic errors also known as ‘‘fat tionary trees over different organisms (Needleman & Wunsch,
fingering’’ has been made by an accidental keying of a letter in 1970). Other pattern matching algorithms are classified below.
place of another (Kukich, 1992). Cognitive errors refer to errors that Rule-based techniques attempt to use the knowledge gained
have been made by a lack of knowledge from the writer or typist from spelling error patterns and write heuristics that take advan-
(Kukich, 1992). Phonetic errors can be seen to be a subset of cogni- tage of this knowledge. For example, if it is known that many errors
tive errors. These errors are made when the writer substitutes let- occur from the letters ‘‘ie’’ being typed ‘‘ei’’, then we may write a
ters they believe sound correct as a word, which in fact leads to a rule that represents this (Yannakoudakis & Fawthrop, 1983).
misspelling (Jurzik, 2006). The character n-gram-based technique coincides with the charac-
There are many isolated-word error correction applications, and ter n-gram analysis in non-word detection. However, instead of
these techniques entail the problem to three sub-problems: treat observing certain bi-grams and trigrams of letters that never or
detection of an error, generation of candidate corrections and rank- rarely occur, this technique can calculate the likelihood of one
ing of candidate corrections as a separate process in sequence character following another and use this information to find possi-
(Kukich, 1992). In most cases there is only one correct spelling ble correct word candidates (Ullman, 1977).
for a particular word. However, there are often several valid possi- Longest Common Sub-String algorithm (Friedman & Sideli, 1992)
ble name combinations for a particular one, such as ‘Aaron’ and finds the longest common sub-string in the two strings being com-
‘Erin’. Also using nicknames in daily life, for instance ‘Bob’ rather pared. For example, the two name strings ‘Martha’ and ‘Marhtas’
that ‘Robert’ make matching of personal names more challenging have a longest common sub-string ‘Marta’. The total length of the
compared to general text. Many variations for approximate string common sub-strings is 5 out of 6 and 7. A similarity measure can
matching have been developed (Gong & Chan, 2006; Pfreifer, be calculated by dividing the total length of the common sub-
Poersch, & Fuhr, 1996; Zobel & Dart, 1996). Although most of the strings by the minimum, maximum or average lengths of the two
techniques that are going to be discussed in this paper are partic- original strings. As shown with the example above, this algorithm
C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312 6309

is suitable for compound names that have words swapped. The called diphthongs according to a set of rules for grouping conso-
time complexity of the algorithm, which is based on a dynamic nants and then maps groups to metaphone codes.
programming approach [x], is O(js1j  js2j) using O(min(js1j, js2j)) The most closest one to our approach is, Syllable alignment dis-
space. tance (Gong & Chan, 2006) is based on the idea of matching two
Jaro (Yancey, 2005) is an algorithm commonly used in data link- names syllable by syllable, rather than character by character. It
age system. Recently this algorithm is being used to measure the uses Phonix transformations as a preprocessing step and then ap-
distances between the words or names! The Jaro distance metric plies rules to find the beginning of syllables. The distance between
states that given two strings s1 and s2, their distance dj is: two strings of syllables is then calculated using an edit distance
  based approach.
1 m m mt
dj ¼ þ þ ð1Þ
3 js1 j js2 j m
3. Methodology
where, m is the number of matching characters and t is the number
of transpositions. 3.1. Personal Name Recognizing Strategy
The Winkler (Yancey, 2005) algorithm improves over the Jaro
algorithm by applying ideas based on empirical studies which Personal Name Recognizing Strategy (PNRS) (Varol, Bayrak,
found that fewer errors typically occur at the beginning of names. Wagner, & Goff, 2010) is based on the results of number of strate-
The Winkler algorithm therefore increases the Jaro similarity mea- gies that are combined together in order to provide the closest
sure for agreeing on initial characters (up to four). match (Fig. 1). The near miss strategy and phonetic strategy are
Naturally, n-grams can be used to calculate probabilities and used to provide the suggestions at the identical time and weight,
this has led to the probabilistic techniques demonstrated by Lee since the input data heavily involves with names and possible
(1999). In particular, transition probabilities can be trained using international scope; it is difficult to standardize the phonetic
n-grams from a large corpus and these n-grams can then represent equivalents of certain letters. Once we have a list of suggestions,
the likelihood of one character following another. Confusion prob- an edit-distance algorithm is used to rank the results in the pool.
abilities state the likelihood of one character being mistaken for In order to provide meaningful suggestions, the threshold value, t
another. Neural net techniques have emerged as likely candidates is defined as 2. In cases where the first and last characters of a
for spelling correctors due to their ability to do associative recall word did not match, we modified our approach to include an extra
based on incomplete and noisy data. This means that they are edit distance. The main idea behind this approach is that people
trained on the spelling errors themselves and carry the ability to generally can get the first character and last character correct
adapt to the specific spelling error patterns that they are trained when trying to spell a word. Another decision mechanism is
upon (Trenkle & Vogt, 1994). needed to suggest the best possible solution for individual names.
The rationale behind this idea is that at the final stage there is a
possibility to come up with several possible candidate names for
2.2. Phonetic techniques a mistyped word that has one or two edit distance from the origi-
nal word. Relying on edit distance often does not provide the de-
All phonetic encoding techniques attempt to convert a name sired result. The US Census Bureau study, which has compiled a
string into a code according to the way a name is pronounced. list of popular first and last names that are scored based on the fre-
Therefore, this process is language dependent. Most of the de- quency of those names within the United States (Census, 1990), is
signed techniques have been mainly developed based on English added as the decision making mechanism. This allows the tool to
phonetic structure. However, several other techniques have been choose a ‘‘best fit’’ suggestion to eliminate the need for user inter-
designed for other languages as well (Christen, 2006). action. The strategies are applied to a custom dictionary which is
The SOUNDEX (Philips, 1990), which is used to correct phonetic designed particularly for the workflow automation.
spellings, maps a string into a key consisting of its first letter fol-
lowed by a sequence of digits. All vowels, ‘h’, ‘w’ and ‘y’ are re- 3.1.1. Correction rates of PNRS
moved from the sequences to get the Soundex code of a word. The experimental data is a real-life sample which involves per-
Then it takes the remaining English word and produces a four-digit sonal and associated company names, addresses including zip
representation (shorter codes are extended with zeros), which is a
primitive way to preserve the salient features of the phonetic pro-
nunciation of the word. A major problem with Soundex is that it
keeps the first letter, thus any error at the beginning of a name will
result in a different Soundex code.
Phonex (Lait & Randell, 1993) tries to improve the encoding
quality by pre-processing names according to their English pro-
nunciation before the encoding. All trailing ‘s’ are removed and
various rules are applied to the leading part of a name (for example
‘kn’ is replaced with ‘n’, and ‘wr’ with ‘r’). Like Soundex, the leading
letter of the transformed name string is kept and the remainder is
encoded with numbers (1 letter, 3 digits).
Phonix algorithm is an improvement for the Phonex and applies
more than one hundred transformation rules on groups of letters
(Gadd, 1990). Some of these rules are limited to the beginning of
a name, some to the end, others to the middle and some will be ap-
plied anywhere.
The Metaphone algorithm is also a system for transforming
words into codes based on phonetic properties (Philips, 1990). Un-
like Soundex, which operates on a letter-by letter scheme, meta-
phone analyzes both single consonants and groups of letters Fig. 1. PNRS Strategy.
6310 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312

code, city, state, and phone numbers individuals. Dirty data is pres- Table 2
ent in total of 4000 records, including misspelled names and some Minimum and maximum correction rates of the algorithms.

non-ASCII characters 2,648 records were correctly fixed by PNRS, Strategy Minimum and maximum fixed j
119 records were identified as valid names, while 1233 records Matched percentage for test case
were corrected but produced different names as reflected in Soundex 49.9–52.2%
Fig. 2 and Table 1. Phonex 48.7–51.3%
Phonix 48.8–49.1%
DMetaphone 51.9–52.6%
 Fixed j Matched ? Exact correction of the misspelled name. Levenshtein edit-distance 41.2–55.1%
 Fixed j No Match ? Corrections that provide no match with the LCS 57.7–64.3%
original name. Jaro–Winkler 64.4–65.6%
 Match j No Match ? Either the input is accepted as a valid name 2-Grams 57.2–61.5%
3-Grams 52.2–58.3%
or the system failed to provide any suggestions.
PNRS 66.2%

3.1.2. Correction rate comparison with other suggestion algorithms


In order to evaluate the effectiveness of the tool, experiments
are conducted not only on current correction algorithm PNRS but bility density function of the frequency of usage of the names. The
also on the well known general text spelling correction tools, such overall system architecture is shown in Fig. 3 below.
as Soundex, Phonex, Phonix, DMetaphone, Levenshtein Edit Kullback–Leibler distance, or in other words relative entropy is
Distance, LCS, Jaro–Winkler, and n-grams. PNRS averaged 66% a non-commutative measure of the difference between two prob-
correction rate on the test. Since there is no concrete decision ability distributions p and q, while p represents the ‘‘true’’ distri-
mechanism if there is more than one same score with the Soundex, bution for data, q represents the model or approximation of p
Phonex, Phonix, DMetaphone, Levenshtein Edit Distance, LCS, Jaro– (Bilgi, 2003). In this study, Kullback–Leibler distance between
Winkler, and n-grams algorithms, it is a challenge to claim which the two PDFs p(x; hq) and p(x; hi) corresponding to query original
algorithm performs well. However, if we look at the minimum names and suggested name estimations respectively, i.e. Rice
and maximum likelihood of full correction rates (Table 2) among and Ryan (see Table 1), leads us to define retrieval rate in terms
all these algorithms and PNRS, we would arguably claim the cor- of understanding estimation quality. We denote whole space of
rection rate of PNRS is satisfactory. PRNS from Census database and original model names as h 2 H.
Minimum K–L distance means high estimation of the quality of
3.2. Similarity measurement service. We extract a feature vector P which is in fact the Census
PNRS Frequency in percentage and edit distance between original
Finding good similarities between suggested names and origi- names and PNRS suggested names. We should remember that gi-
nal names is a challenging task. Finding the common properties ven the submitted uncorrected names extracting the features of
is a measure of the shared information between these words/ PNRS suggested names have already been done by a Maximum
names. The greater the common properties, the more similar two Likelihood Estimator.
names. Since a good distance measure should be invariant under Since we have total of 1233 not fully corrected names, we select
change of base measure, we introduce a measure that asses the the N = 1233 notches by making similarity measurement.
opposite of the common property: the relative entropy, also known Z
pðx; hq Þ
as the Kullback–Leibler’s distance. In this study, we will use the DðpðX; hq ÞkpðX; hi ÞÞ ¼ pðx; hq Þ log dx ð2Þ
pðx; hi Þ
two main variables that we have used with PNRS strategy, the
probability density function of edit-distance scores and the proba- To combine the K–L Distance from names implementing multi-
ple attributes like X = Edit-Distance PNRS and Y = Census PNRS Fre-
quency data set, we can use chain rule, which states that the K–L
PNRS Algorithm Results For Test Case distance between two joint PDFs p(X, Y) and q(X, Y) is:
3000 DðpðX; YÞkqðX; YÞÞ ¼ DðpðXÞkqðXÞÞ þ DðpðYjXÞkqðYjXÞÞ ð3Þ
2500 The data is considered to be independent of Edit-Distance PNRS
2000 and Census PNRS Frequency data, thus the joint K–L distance be-
PNRS Algorithm Results comes simply the sum of two K–L from each attribute set instead
1500
For Test Case of above equation.
1000 When searching names similar to those in the query names (i.e.
Jason and other names from Census database) xq with their attri-
500
butes as edit distance and frequency p(q), we can introduce K–L
0 model in discrete version as a distance between the query original
Fixed | Fixed | No Match | No name and each candidate PNRS suggested name which can be used
Matched Matched Match
as a rank among the names in the Census database. This discrete
Fig. 2. PNRS correction rate for test case. version is written with one attribute (normalized edit distance)
and can be easily expanded into two attributes by just adding an-
other attribute (frequency) of PNRS names (i = 1,2). qi denotes the
original name where i denotes the index of each attribute, 1233 re-
Table 1
flects the total number of correction attempt that produced no
Definition of PNRS result.
match result. In our example we implement two attributes. K–L
Misspelled name Original name PNRS suggestion State of result Distance D will give us a ranking status about how well PNRS strat-
Siteffaannny Stephanie Siteffaannny Match j No Match egy works. As can be easily understood, we intend to see the que-
Luiz Luis Luiz Match j No Match ried names in the first rank of the K–L calculations according to the
Rca Rice Ryan Fixed j No Match
two attributes which are assumed to be independent of each other
Jaso Jason Jason Fixed j Matched
statistically.
C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312 6311

Fig. 3. The K–L distance role in name retrieval system architecture for estimation of quality of service.

q
!
X
2 X
1 233
ðq Þ pki For example ‘‘Jason’’ comes from the query model p(X; hq) which
DðpðqÞ kpðPNRSÞ ¼ pk i log PNRS generates all query names x = (x1, x2, . . . ,xL) 2 X from Census data.
i¼1 k¼1 pk i
! The introduced K–L model is developed by the probability dis-
X
2 X
1233
q q
X
1233
q PNRSi tributions of edit-distance and frequency of usage of the names.
¼ pki log pki  pki log pk ð4Þ
Also we introduce PNRS Distance Metric (PNRSDM) as to calculate
i¼1 k¼1 k¼1

a Histogram of the PNRS vs Original Records

0.016
Histogram of Census Frequency

0.014
0.012

0.01
Original Names
0.008
PNRS Suggestion
0.006
0.004
0.002
0
1 206 411 616 821 1026 1231
Number of Records

b Histogram of Edit Distances c


1.2

1
Histogram of Edit
0.8 Distance for
suggested Names
0.6
Histogram of Edit
0.4 Distance for
original Names
0.2

0
1 2
Edit Distance

Fig. 4. (a) The histogram of the PNRS and original records, (b) histogram of edit distances of the PNRS and original records, (c) probability density functions of the PNRS and
original records.
6312 C. Varol, C. Bayrak / Expert Systems with Applications 38 (2011) 6307–6312

the similarity between two names based on the character Cruz, R. L. (1995). Quality of service guarantees in virtual circuit switched networks.
IEEE Journal of Selected Areas Communications, 13(6), 1048–1056.
difference.
Damerau, F. J. (1990). Evaluating computer generated domain-oriented
( ) vocabularies. Information Processing and Management, 26, 791–801.
1Xi¼2
LCS Durhaiw, I., Lamb, D. A., & Nad Sax, J. B. (1983). Spelling correction in user
PNRSDM ¼ n ¼ 2; if ti ¼ 0 ð5Þ
n i¼1 jSi j interfaces. ACM, 26, 764–773.
Friedman, C., & Sideli, R. (1992). Tolerating spelling errors during patient validation.
( ) Computers and Biomedical Research, 25, 486–509.
i¼2  
1X LCS jSi j  ti Gadd, T. (1990). PHONIX: The algorithm. Program: Automated Library and
PNRSDM ¼ þ n ¼ 4; if ti > 0 ð6Þ Information Systems, 24(4), 363–366.
n i¼1 jSi j jSi j Georgiadis, L. R., Guerin, V. P., & Sivarajan, K. (1996). Efficient network QoS
provisioning based on per node traffic shaping. IEEE ACM Transactions on
where, LCS is the number of the longest string (or strings) that is a Networking, 4(4), 482–501.
substring (or are substrings) of two or more strings, Si is the number Gong, R., & Chan, T. K. (2006). Syllable alignment: A novel model for phonetic string
search. IEICE Transactions on Information and Systems, E89-D(1), 332–339.
of characters in the ith element, and ti is the number of transposi- Hall, A., & Dowling, G. R. (1980). Approximate string matching. ACM Computing
tions in the ith element. PNRSDM is a normalized similarity mea- Surveys, 12(4), 381–402.
sure between 1.0 (strings are the same) and 0.0 (strings are Jokinen, P., Tarhio, J., & Ukkonen, A. (1996). A comparison of approximate string
matching algorithms. Software – Practice and Experience, 26(12), 1439–1458.
totally different). Jurzik, H. (2006). The Ispell and Aspell command line spellcheckers. <www.linux-
magazine>, 85. issue, 63–66.
Khaddaj, S., & Horgan, G. (2004). The evaluation of software quality factors in very
4. Discussions and conclusion
large information systems. Electronic Journal of Information Systems Evaluation,
7(1), 43–48.
In order to evaluate the PNRS and statistical distance metrics Klingemann, J., Wasch, J., & Anfaberer, K. (1999). Deriving service models in cross-
organizational workflows. In Proceedings of RIDE – information technology for
and understand how close we are to the ‘‘Fixed’’ names from our
virtual enterprises (RIDE-VE ’99), Sydney, Australia (pp. 100–107).
No Match names list as mentioned 4,000 misspelled names are Kukich, K. (1992). Techniques for automatically correcting words in text. ACM
used. PNRS corrected 66.2% correctly, which did perform better Computing Surveys, 24(4).
than other mentioned techniques. On the other hand, the distance Lait, A., & Randell, B. (1993). An assessment of name matching algorithms. Technical
report, Department of Computer Science, University of Newcastle upon Tyne,
between the original names and suggested names is defined as 1993.
joint KLD. We found out that the joint KLD is simply the sum of edit Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th annual
distance and Census PNRS Frequency KLDs acquired from each meeting of the ACL.
Levenshtein, VI. (1965). Binary codes capable of correcting deletions, insertions and
independent data set. We obtained KL = 4.9195 under the histo- reversals. Doklady Akademii Nauk SSSR, 163, 845–848. also (1966) Soviet Physics
gram of Census PNRS Frequency (Fig. 4(a)) and histogram for edit Doklady 10, 707–710.
distances for suggested names against the original query names Miller, J. A., Fan, M., Wu, S., Arpinar, I. B., Sheth, A. P., & Kochut, K. J. (1999). Security
for the Meteor workflow management system, UGA-CS-LDIS Technical Report,
(Fig. 4(b)). Moreover, probability density functions for each vector University of Georgia.
are presented in Fig. 4(c). The PNRS Distance Metric averaged Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing
0.483927 out of 1233 records. Surveys, 33(1), 31–88.
Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search
for similarities in the amino acid sequence of two proteins. Journal of Molecular
Acknowledgments Biology, 48, 443–453.
Nerbonne, J., Heeringa, W., & Kleiweg, P. (1999). Edit distance and dialect proximity.
In D. Sanko & J. Kruskal (Eds.), Time warps, string edits and macromolecules: The
The authors are to extend recognition to Dr. Rustu Murat
theory and practice of sequence comparison (2nd ed., pp. v–xv). Stanford: CSLI.
Demirer for being instrumental during the implementation stage. Philips, L. (1990). Hanging on the metaphone. Computer Language, 7(12), 39–43.
Pfreifer, U., Poersch, T., & Fuhr, N. (1996). Retrieval effectiveness of proper name
References search methods. Information Processing and Management, 32(6), 667–679.
Trenkle, J. M., & Vogt, R. C. (1994). Disambiguation and spelling correction for a
neural network based character recognition system. In Proceedings of SPIE. (vol.
Becchetti, C., & Ricotti, L. P. (1999). Speech recognition: Theory and C++ 2181, pp. 322–333).
implementation. John Wiley & Sons. Ullman, J. R. (1977). A binary n-gram technique for automatic correction of
Bilgi, B. (2003). Using Kullback–Leibler distance for text categorization. In substitution, deletion, insertion, and reversal errors in words. Computer Journal,
Proceedings of the ECIR-2003. Lecture notes in computer science (Vol. 2633, 20(2), 141–147.
pp. 305–319). Springer-Verlag. Varol, C., Bayrak, C., Wagner, R., & Goff, D. (2010). Application of near miss strategy
Boehm, B. W, Brown, J. R, Kaspar, H., Lipow, M., Macleod, G., & Merritt, M. J. (1973). and edit distance to handle dirty data. In Data engineering: Mining, information,
Characteristics of Software Quality, TRW Software Series TRW-SS-73-09. and intelligence. Springer.
Brill, E., & Moore, R. C. (2002). An improved error model for noisy channel spelling Varol, C., & Bayrak, C. (2005). Applied software engineering education. In ITHET
correction. In Proceedings of ACL-2000, the 38th annual meeting of the association 2005, July 6–9, 2005, Santa Domingo, Dominican Republic.
for computational linguistics (pp. 286–293). Varol, C., Bayrak, C., Ludwing, R. (2005). Application of software engineering
Census Bureau Home Page (1990). <www.cencus.gov>. fundamentals: A hands of experience. In The 2005 international conference on
Christen, P. (2006). A comparison of personal name matching: Techniques and software engineering research and practice, June 27–30, 2005, Las Vegas, Nevada,
practical issues. ICDM Workshops 2006 (pp. 290–294). USA.
Christen, P., Churches, T., & Hegland, M. (2004). Febrl – A parallel open source data Winkler, W. E. (2006). Overview of record linkage and current research directions.
linkage system. In PAKDD, Springer LNAI 3056, Sydney 2004 (pp. 638–647). Technical Report RR2006/02, US Bureau of the Census.
Christen, P., & Goiser, K. (2006). Quality and complexity measures for data linkage Yancey, W. E. (2005). Evaluating string comparator performance for record linkage.
and reduplication. In F. Guillet & H. Hamilton (Eds.), Quality measures in data Technical Report RR2005/05, US Bureau of the Census.
mining. Studies in computational intelligence. Springer. Yannakoudakis, E. J., & Fawthrop, D. (1983). The rules of spelling errors. Information
Clark, D., Shenker, S., & Zhang, L. (1992). Supporting real-time applications in an Processing Management, 19(2), 87–99.
integrated services packet network: Architecture and mechanism. In Zobel, J., & Dart, P. (1996). Phonetic string matching: Lessons from information
Proceedings of ACM SIGCOMM (pp. 14–26). retrieval. In Proceedings of ACM SIGIR, Zurich, Switzerland (pp. 166–172).
Cohen, W. W., Ravikumar, P., & Stephen, E. F. (2003). A comparison of string distance
metrics for name-matching tasks. In Proceedings of IJCAI-03 workshop on
information integration on the web, Acapulco (pp. 73–78).

You might also like