Fuzzy Matching in 4th Dimension: Sources of Error
Fuzzy Matching in 4th Dimension: Sources of Error
By David Adams
Technical Note 06-18
Overview
------------------------------------------------------------------------------------------------------------------------------------------------
Andrew Wallace
Andrew Wallac
Andrew Walace
Andrw Wallace
Andrw Wallac
Andrw Walace
Sources of Error
Software is great at matching identical values but not usually so good at matching
similar values. Identical real-world items can end up represented in multiple, slightly
different, records very easily. The causes of error are commonplace and numerous,
including:
Mechanical data entry errors, such as typos and transcription mistakes.
Misspellings, a particular problem with proper names as data entry and lookup are
often based on a what a name sounds like rather than how it is written.
Data which can be legitimately or plausibly spelled or abbreviated in more than one
way, such as the components of names and addresses.
Values which do not have a universally consistent format, such as dates.
Flawed data imported from another database or OCR (Optical Character
Recognition) system.
Flawed data entry from Web queries or data submissions. This is an increasingly
significant source of data quality problems since Web-based systems are potentially
available to a global audience with a wide range of language and typing skills.
Names or other values that have changed spellings over time, such as surnames
that have been modified over several generations or by transportation to a new
country.
Names or other values that vary in spelling over distance, such as the regional
variations in many words, names, and abbreviations.
Whatever the original source of the error, inconsistency, or ambiguity, matching
values or full records based on similarity is a challenge for many classes of software,
including databases, data warehouses, and spell-checkers. Fortunately for us as
4th Dimension developers, fuzzy matching is a problem that well-funded groups have
to deal with. Consequently, excellent computer science has been around for several
decades to develop, test, and refine effective fuzzy matching algorithms. This
technical note explains two well-established approaches to fuzzy matching: phonetic
transcription and string distance measurements. The phonetic algorithms are biased
towards English while the distance algorithms should work well for comparison of
strings of any kind. This technical note provides detailed information on the
algorithms, their strengths, limitations, and appropriate uses.
Before looking into solutions in detail, lets review some scenarios that illustrate cases
when fuzzy matching is desirable or necessary.
A help desk in Bangalore takes support calls for registered users of a popular
software package. While customers are supposed to have a registration number
when they call, the support staff is trained not to refuse service if a valid customer
record can be found. Unfortunately, the support staff often fails to find valid records
because the exact spellings of last names and formatting of addresses in the
database is inconsistent. When this happens, no service is provided, leaving
legitimate customers frustrated and upset with the company. In this case, phonetic
and distance-based matching increase the chances of matching a caller to an
existing registration.
A local regional government is undertaking a project to increase voter participation
in their area. The government has records of voter registrations, voter participation,
land titles, and waste disposal fees for the past twenty years. They need to link and
consolidate these sources of data to get a realistic baseline of the existing
populations voting behavior. In this case, multiple forms of fuzzy matching enable
the council to substantially automate the linkage process.
At a busy regional hospital emergency room, the intake staff try to match incoming
patients to existing records. It is important to find matches because, otherwise, any
existing medical history on file will not be accessible to the doctor assessing the
patient. This can make a life-or-death difference if, for example, the patient has a
drug allergy. Unfortunately, the atmosphere in the emergency room is always
rushed, so intake staff take have about twenty seconds to match a patient before
giving up and creating a new "temporary" record. The staff and hospital
administrators are all aware that many existing patient records are not matched and
turn to fuzzy matching solutions.
After observation and analysis of the database and intake system, it turns out that
fuzzy matching can help at the system level and on the hospital floor. The central
database consolidates records from all of the clinics and hospitals in the area.
Because of data entry errors, inconsistencies in abbreviations, and changes in
patient addresses, many duplicate records exist. Fuzzy matching can help automate
reducing the level of duplication by comparing records for overall similarity based on
name, address, and other key fields. On the hospital floor, intake staff often search
for names based on what they hear rather than an exact spelling. Adding a phonetic
search to the system and showing the matches sorted by similarity to the search
term significantly increases the chances of finding an existing record without slowing
down the intake process.
While the scenarios above are imaginary and probably dont describe any of your
projects precisely, they are typical of how and when fuzzy matching is applied in the
real world. With these stories in mind, or any scenarios from your own work, lets look
at the tools provided with this technical note.
Included with this technical note is a source code database, the FuzzyTools
component, a sample database that uses the component, and a selection of sample
data sets in text files for experimentation. The sample database provided includes a
screen for experimenting with phonetic keys and distance measurements for a
collection of data. Two sample data sets are included with the demonstration, one with
about 15,000 surnames and another with about 5,000 place names. To get a better
sense of how the tools in the component work, create a fresh data file and import
some values from one of your systems. For the import, prepare a text file with two
columns:
[Sample]Word
[Sample]Data_Set_Name
Alpha 80
Alpha 20
are often spelled in several ways due to mistakes and historical, or regional variation.
Several algorithms have been developed over the years to address this problem.
Soundex is the most universally known phonetic name translation algorithm for
English. Some version of Soundex is found, for example, in every major SQL
database. Soundex was originally developed by the US Census in the 1880s, well
before the advent of computers. Several other, more modern, algorithms offer better
performance and pattern matching success than Soundex. Before reviewing some
options, let's review why sounds-like matching is needed at all.
English has five or six letters for representing vowels but far more vowel sounds.
Phonetically, vowels are sounds made by opening your mouth and letting the air out
while your vocal chords vibrate. (The tongue is not used in the production of vowels.)
Described differently, vowels are produced without closing the mouth. Consider the
letter o in the words below:
hot
don
poke
While the same vowel letter is used in each word, three very different sounds are
indicated h-AHHH-t, d-AWW-n, p-OWW-ke, at least in my particular American
accent.
To make matters even more confusing, the exact number of vowel sounds (vowels
and diphthongs) depends on accent or dialect. For example, contemporary Australian
English accents generally have about 20 vowel sounds and contemporary American
varieties of English typically have 12-16 vowel sounds. Despite this, their spelling
systems are fundamentally the same. (Their differences, in fact, are almost never in
letters which are pronounced.)
While weve quickly discussed vowels, the same points apply to consonant sounds. For
example, the sound f may be spelled in at least four ways in English:
frank
taffy
philip
rough
night
nite
There are many well-known phonetic transcription algorithms for English, including
Soundex, Metaphone, Double-Metaphone, NYSIIS, Phonix, Caverphone, amongst
others. All of these algorithms perform the same task but with different letter-tophoneme rules. The FuzzyTools system implements four variants of Soundex,
Metaphone, and a tool called Skeleton Key. In practice, Metaphone appears superior
to the other algorithms implemented at dealing with surnames. The other algorithms
are included for comparison purposes and, in the case of Soundex, because the
algorithm is so widely used. You can test your own data in the sample database to
determine how each approach performs. While reading this, you may find it useful to
open the demonstration database and look at some last names or words and see how
they are encoded by each algorithm. Just select Show Words from the Demo menu
and double-click on any word. If you want to try testing out some values of your own,
you can import them. Alternatively, use the Compare Strings demonstration to enter
one or two strings and see how they are encoded. Lets review the algorithms next.
Soundex
Soundex was invented for the US Census in the 1800s to help reduce errors and to
simplify accessing records in the future. Even at that time, the Census was aware that
names were changing and the Census data would be a significant historical
demographic and genealogical data source. Because Soundex was designed as a
manual system, it is quite simple. You can read a short history of Soundex here:
http://www.archives.gov/genealogy/census/soundex.html
The following brief summary of the Soundex encoding rules is adapted from the same
page. Every Soundex code consists of a letter and three numbers, such as W252. The
letter is always the first letter of the surname. The numbers are assigned to the
remaining letters of the surname according to the Soundex guide shown below.
Zeroes are added at the end if necessary to produce a four-character code. Additional
letters are disregarded. Examples:
Washington is coded W252 (W, 2 for the S, 5 for the N, 2 for the G, remaining
letters disregarded).
Lee is coded L000 (L, 000 added).
Number
1
2
3
4
5
6
Notes
Produces a four character Soundex key using code based on
Knuths The Art of Computer Programming.
Soundex_Miracode
Soundex_Simplified
Soundex_SQLServer
Skeleton Key
The skeleton key of a word consists of its first letter, followed by the consonants in the
source word in order of appearance, followed by the vowels in the source word in
order of appearance. This key contains every letter from the original string at most
once. As an example, the word Washington is encoded as WSHNGTAIO. W is the
first letter, SHNGT are the remaining consonants, and AIO are the vowels.. The
skeleton key system is part of an approach to spell-checking discussed in Automatic
Spelling Correction in Scientific and Scholarly Text by Pollock and Zamora, a
much-cited paper first published in 1984. The rest of the strategy outlined in that
paper is not implemented here. Why is this incomplete adaptation included at all?
When we consider how to pick a good phonetic transcription algorithm, the skeleton
key method provides a helpful point of comparison. This system tends to produce
codes that are close to unique. Therefore, you get very few false positives (words
sharing a code that aren't meaningfully similar) but, consequently, it does little to
accurately match similar values.
Metaphone
The Metaphone algorithm was originally designed by Lawrence Philips in 1990 to
produce phonetic transcriptions superior to Soundex. While not perfect, Metaphone is
markedly superior to Soundex in the data sets Ive tested.
Now, lets look at Metaphones rules. Metaphone produces a variable-length code
based on an original string. If there is an initial vowel, it is retained. All other vowels
are dropped. All other letters/letter groups are recoded into one of the following
consonant sounds:
B X S K J T F H L M N P R 0 W Y
Note that what could be mistaken for an "oh" is actually a zero, used to stand in for
the English sound "th".
There are short number of exception considered for word beginnings, summarized in
the table below:
Begins With
ae
gn
kn
pn
wr
x
wh
Rule
Drop the first letter
Drop the first letter
Drop the first letter
Drop the first letter
Drop the first letter
Change to "s"
Change to "w"
Example
Aebersold
Gnagy
Knuth
Pniewski
Wright
Xiaopeng
Whalen
Transformation
ebersold
nagy
nuth
niewski
right
siaopeng
walen
S Otherwise.
T X (sh) if "-tia-" or "-tio-"
0 (th) if before "h"
Silent if in "-tch-"
T Otherwise.
V F
W Silent if not followed by a vowel.
W If followed by a vowel.
X K
S
Y
Silent if not followed by a vowel.
Y If followed by a vowel.
Z S
As you can see, there are a lot of rules to Metaphone. The FuzzyTools Metaphone
implementation is based on the Metaphone.java source code file found in the Apache
Jakarta Project Codec source, available here:
http://jakarta.apache.org/site/downloads/downloads_commons-codec.cgi
Since its introduction, various refinements of Metaphone have been advanced, most
notably Philips Double-Metaphone algorithm, released in 2000 and readable here:
http://www.cuj.com/documents/s=8038/cuj0006philips/
The commented C++ source code for Double Metaphone runs to over 850 lines and is
not implemented in FuzzyTools. Ultimately, ad-hoc algorithms such as Metaphone,
NYSIIS, Soundex, and so on reach a dead end. Each time a new transcription rule or
exception is recognized, the code has to be rewritten. Well discuss more flexible ruledriven strategies a bit further on. Before that, however, lets address the most
immediate and practical question: how to pick an algorithm to use with your data.
common data set name into the [Sample] table and the system prepares all of the
keys need for testing. The fields to import into are defined below:
[Sample]Word
[Sample]Data_Set_Name
Alpha 80
Alpha 20
Tip
When working with the data, it is best to run the database compiled for better
performance.
10
The data from the screen above is repeated below for legibility. (Metaphone4 and
Metaphone6 produce the same results, in this case, so they are only shown once
below.)
MethodKnuth Miracode Simple SQLServer Skeleton Key Metaphone
Z200
Z200
ZYSK
SSK
CodeZ000 Z000
Matches
3
3
5
6
1
8
Zksko Zksko
Zackay Zackay
Zysko
Cisco
Zyki Zyki
Ziak
Ziak
Sasaki
Zysko Zysko
Zug
Zksko
Seesock
Zyki
Zug
Sisco
Zysko Zyki
Siscoe
Zysko
Sisk
Suzuki
Zysko
11
of the comments about Soundex has been disparaging, it can still be a help in the real
world. A harder problem to contend with are false-negatives/missed-positives. Well
consider this point again later when comparing the results of using phonetic and
distance-matching techniques to locate matches. Briefly, however, you should know
that weighted distance measurements find far more true positives but with a higher
runtime speed cost than phonetic matching against stored phonetic codes.
Combining Techniques
Two imperfect fuzzy matching systems, when combined, can produce better results
than either one alone. For example, you could use Metaphone4 to match words and
then rank the results to give extra weight to words also matched by Soundex. A
common and very powerful way of combining techniques is to start with a phonetic
match and then refine or rank the results using one of the string-comparison
algorithms discussed later. You can see this idea in action on the second screen of the
sample word input form, pictured below:
This page presents an alphabetic list of all of the words found by any of the phonetic
matching algorithms. Phonetic matching columns marked with an x indicate the
specific algorithm matched the word in that row. In the example above, "Cisco" was
only matched by Metaphone4 and Metaphone6 while "Zackay" was matched by the
Simple and SQLServer variants of Soundex.
12
On the right side of the screen six different measurements are listed, derived from the
string distance algorithms discussed later. Note that the Jaro, Lynch, McLaughlin, and
Winkler algorithms return real numbers on the scale 0=unlike strings and 1=identical
strings. Therefore, smaller values signify less similarity. The "Edit" (edit distance)
algorithm returns an absolute count of the number of changes required to transform
one string into another. Therefore, the score increases as strings are less similar. The
LCS (Longest Common Subsequence) score is the length of the longest shared
subsequence within the two strings, ignoring non-matching characters. Therefore,
scores increase as the strings are more similar. If you don't want to worry about the
differences between these scoring systems, tick the Normalize number scales check-box
Well return to these weighting schemes later, but, for now, notice that the matched
words can be ranked based on their calculated similarity. For example, the values are
listed below ranked by their raw edit distance scores and Lynch weights compared
with Zysko. Raw scores and percentages are shown, for comparison.
Sample
Word
Zysko
Zksko
Zyki
Sisk
Sisco
Ziak
Siscoe
Suzuki
Cisco
Zackay
Sasaki
Seesock
Zug
Raw Scores
Percentages
Edit LCS Lynch Edit LCS
Lynch
0
5 1.000 100.0 100.0 100.0
1
4 0.886 80.0 80.0
88.6
2
3 0.862 60.0 60.0
82.6
3
2 0.723 40.0 40.0
72.3
3
2 0.720 40.0 40.0
72.0
3
2 0.710 40.0 40.0
71.0
4
2 0.687 33.0 33.0
68.7
5
2 0.683 16.0 33.0
68.3
3
2 0.680 40.0 40.0
68.0
4
2 0.653 33.0 33.0
65.3
4
2 0.651 33.0 33.0
65.1
6
2 0.630 14.2 28.5
63.0
4
1 0.560 20.0 20.0
56.0
As you can see, the higher-quality match suggestions are ranked similarly by the
different similarity calculating algorithms. The rankings are not, however, identical.
You can combine the findings of multiple distance calculation schemes if you wish to
attempt to automatically refine a list of matches more precisely.
Apart from ranking suggestions, it is possible to filter results to avoid too many false
positives. In the case above, filtering out matches with an edit distance over 3
reduces the list of possible matches to a fairly reasonable set of candidates:
Zysko
Zksko
Zyki
Cisco
Sisco
13
Sisk
Ziak
Note
The edit distance algorithm implemented here is called the Levenshtein distance, the
same algorithm used by a wide range of spell-checkers to rank word suggestions.
Limiting the results to matches with a Lynch weight of .70 (70% similarity) or higher
produces a nearly identical list:
Zysko
Zksko
Zyki
Sisk
Sisco
Ziak
While still imperfect, it may be better to reduce the possibilities to a reasonable
number when presenting choices to a user or automatically performing duplicate
checking scans. Well look at how to construct duplicate checking scans and the
distance-calculating algorithms more below. First, lets follow-up on a topic I
mentioned: rule-driven phonetic transcription algorithms.
Overview
Its easy to believe that there are dozens of different phonetic translation algorithms
available, when you consider the many variants of the basic algorithms outlined
already. In fact, I think its more reasonable to say that there is really only one
algorithm implemented with different hard-coded rule sets . The algorithm looks like
this in pseudo-code:
Scan through a block of text from start to end
Preprocess
14
Post-process
Primarily, the rules depend on each character's value, position, and neighbors. Why,
then, are there so many different algorithms? A related question is why are there so
many algorithms for phonetic transcription of a single language? The answer to both
questions is the same: these algorithms have been developed in an ad-hoc manner to
capture the rules of a specific accent without significant assistance from linguists. In
fact, linguists consider phonetics to be a rule-driven system. It only makes sense to
solve the problem with a rule set. This approach is a perfect example of what is often
called table-driven or data-driven programming.
15
16
$isFollowedByAVowel_b:=Metaphone_CharacterIsAVowel (Metaphone_GetNextChararacter
($working_string;$character_index+1))
Case of
: ($character_index=1)
$workingCode_s40:=$workingCode_s40+"K"
: ($isAtEndOfWord_b)
$workingCode_s40:=$workingCode_s40+"X"
: ($isFollowedByAVowel_b)
$workingCode_s40:=$workingCode_s40+"X"
: ($isFollowedByAVowel_b=False)
$workingCode_s40:=$workingCode_s40+"K"
End case
Else
$workingCode_s40:=$workingCode_s40+"K"
End case
I picked the case of C deliberately as it is one of the most complex in the Metaphone
system. In Double Metaphone, the C is even more complex. The rules shown above
are easier to follow if extracted:
If
If
If
If
If
If
If
If
If
If
If
If
the
the
the
the
the
the
the
the
the
the
the
the
C
C
C
C
C
C
C
C
C
C
C
C
Presented tabularly, its easier to see how rules can be defined in a readily machineprocessable format:
Followed by Followed by
Consonant? Output
Rule Type Pattern Start of Word? End of Word? Vowel?
Char within SCE
Char within SCI
Char within SCY
X
Char starts CIA
Char starts CI
S
Char starts CE
S
Char starts CY
S
Char starts SCH
K
Char starts CH
TRUE
K
Char starts CH
TRUE
X
Char starts CH
TRUE
X
Char starts CH
TRUE
K
17
Note
The rules listed above are included to expand on the discussion of data-driven
programming. I may have extracted the rules with some flaws and have not doublechecked them through code.
Using a table of rules like these, a rule-processing engine can scan through a block of
text from start to finish without any need for special cases or custom logic embedded
in the code itself. In a full implementation, additional rule types are required by some
algorithms to, for example, pre-transform specific patterns before scanning through
the source string.
There are numerous advantages to using a table-driven strategy:
Since the rules are defined as data, they can be read from multiple sources. For
example, you can add a routine to read through the rules and automatically produce
human-readable descriptions or write a routine that generates various test patterns
to exercise the rules.
The rule-sets are no longer buried in the code, so the source code is dramatically
reduced in size.
Adding support for a new accent or language doesn't require new coding, just a new
rule set. If the rules are defined in records, the database won't even need to be
recompiled. The exception to the no recoding rule is if a particular rule-set
requires a new rule type, such as replace the current character if it is equal to a
specific value and is exactly 3 characters from the front of the word. It is certainly
the case that certain writing systems and spoken languages would require rule types
not shown in the small table above. For example, properly transcribing German
presents special challenges because of the way word spellings are modified when
words are compounded.
No knowledge of programming is required for defining rules. A linguist can help
develop rules that are then fed into the rule processing engine.
Translating the system between computer languages is now much simplified. Only
the rule-processing engine and rule-storage system need to be adapted.
Like any data-driven system, if theres a bug in the processing engine, you only
need to fix it once to fix the system. On the other hand, when there are bugs in the
rule-set, they dont hurt the engine or stop the engine from processing properly
formulated rule-sets correctly.
Storing rules as data makes it possible to switch rules on-the-fly based on other
inputs. It's fair to think of rule sets as parameters, in this scenario . For example, a
Web-based name database could allow users to identify their location, or guess it
from their IP address, and use that as a basis for selecting the default accent ruleset to apply. Instead of switching algorithms, the code simply switches rule sets. Its
easy to imagine loading a special rule-set for an Italian speaker using an Englishlanguage data source.
18
Overview
Approximate string comparison algorithms are another fundamental approach to fuzzy
matching. Instead of converting the strings into another form, as the phonetic
transcription algorithms do, strings are compared directly to calculate their relative
similarity. The degree of similarity is expressed as a percentage, count, or length,
depending on algorithm. These measurements of similarity/difference are often called
distances. There are some major advantages to string distance measurement tools:
Strings are compared directly and completely. This is quite different from phonetic
algorithms where some of the original data is lost or transformed in the process.
The distance calculating methods do an excellent job of finding real positives and
avoiding false negatives of under a wide range of conditions.
The numeric distance measures enable a developer to rank possibilities, sometimes
quite exactly. This is exactly how, for example, most spell-checkers order
suggestions. The most similar words are proposed first according to their relative
similarity.
These algorithms are not tied to data in English or any other language. In fact,
approximate string comparison algorithms are used to compare DNA sequences,
which have only four letters in their alphabet, although with very long words.
The main disadvantage of approximate string comparison techniques is that the
comparisons are between any two strings. If you want to compare a record with every
other record in a table, thats a lot of comparisons:
1 * (records in table -1)
1 is subtracted above because it is not necessary to compare a record with itself.
Under the right circumstances, this approach may be fast enough to use on-the-fly
during data entry. If, however, an entire table needed to be scanned for duplicates,
there are a lot more comparisons, roughly:
(records in table * records in table) / 2
19
If you have 10,000 records to compare, thats 50,000,000 comparisons. (The actual
figure is 49,995,000. Well look at the details on this figure later.) There are ways to
intelligently reduce the number of records that require testing for duplicates, of
course, but direct sequential comparison of an entire table is obviously slower than
searching on an indexed value, such as a stored phonetic key.
Note
All of the weighted columns are calculated on-the-fly when you open the record. Try
moving between records and see if there is any delay. On modest modern equipment,
20
I cant detect one. Note that this form performs at least five indexed queries in the On
Load phase to load the values onto page 1, apart from the distance calculations. On a
small scale, the distance calculation routines are instantaneous. To see how they work
in the context of a full table search, turn to page 3, picture below:
This screen lets you interactively test searching for related words based on a weighted
threshold. In the example above, the system searched for words that are at least 80%
similar to the starting word Zysko. (Well look at what distance values mean
shortly.) This search is not instantaneous, but it can compare over 15,000 records in
less than 2 seconds under 4D Server. These tests were run on modest contemporary
equipment in compiled mode. Finding by raw edit distance can be quite a bit faster
than the find by weight algorithms available on this screen. Its well worth importing
some of your own data into the test database in a fresh data file and seeing how it
performs. Now, lets look at the individual distance measurements in more detail.
Note
The Find By All Methods button runs all six comparisons at once, taking several times
longer than testing any particular method.
21
Notes/Source
The most basic algorithm.
Winkler
McLaughlin
Lynch
Edit Distance
One way to quantify the difference between two strings is by counting how many
differences they have, a measurement commonly called an edit distance. Edit
distances are a primary tool in spell-checkers and are sometimes used in database
comparisons. The Fuzzy_GetEditDistanceCount routine implements one of the bestknown of these algorithms, called the "Levenshtein distance". The name comes from
Vladimir Levenshtein, the scientist who first developed this system in 1965. This
algorithm returns a count of how many additions, deletions and substitutions are
required to transform one string into another. The more differences there are between
the strings, the more steps are required to make them identical and, therefore, the
higher the distance count. Identical strings require no transformations and, therefore,
22
return a count of 0. For an example of two unlike strings, the edit distance between
"kitten" and "sitting" is 3:
0
1
2
3
kitten
sitten
sittin
sitting
As this example illustrates, the two strings don't need to be of the same length to be
compared. Many papers and articles on fuzzy matching include variations on the
original Levenshtein algorithm. For example, different transformations can be given
different costs, or certain substitutions can be given a discounted cost to, for example,
allow for common letter transpositions. In FuzzyTools, I stuck with the original
Levenshtein approach since it is simple to understand and code, fast, effective, and
blind to the language a string contains. Also, I couldn't find any empirical evidence
that the modified edit distance routines are better, generally, than the original in realworld tests. If you are interested reading more about the Levenshtein distance, visit
the helpful Wikipedia entry listed below:
http://en.wikipedia.org/wiki/Levenshtein_distance
The Fuzzy_GetEditDistanceCount routine is a straight translation of the C code by
Michael Gilleland found at:
http://www.merriampark.com/ld.htm
A 4th Dimension developer won't need to understand the Levenshtein distance
algorithm or the 4th Dimension code used to implement it, but only to learn how to
call the Fuzzy_GetEditDistanceCount routine, as illustrated in the example below:
C_LONGINT($distance_count)
$distance_count:=Fuzzy_GetEditDistanceCount ("massey";"massie")
23
If you were finding the longest matching substring, you would end up with the result
highlighted below:
John Anderson
Jon Anderrsen
The substring above is 5 characters long (Ander) out of 13 in the original strings.
Converted to a percentage, that's a similarity of a bit over 38%. Just from looking at
the strings, the score seems too low. The LCS algorithm, implemented in FuzzyTools's
Fuzzy_GetLCSLength routine, finds the longest common subsequence instead of the
longest substring. The difference is that a subsequence ignores non-matching
intervening characters. So, in comparing "Jon Anderrsen" to " John Anderson", the
pattern highlighted below registers as a match (the space character matches but can't
be highlighted):
John Anderson
Jon Anderrsen
Notice that Jon and John match because the characters J-o-n appear, in order, within
J-o-h-n. The h in John is simply ignored as a junk character. The longest matching
subsequence, then, is 11 characters long (Jon Andersn), giving us a similarity
percentage of roughly 85%. This score agrees much more closely with how a human
would rank the two strings.
Note
The two strings in the example above are the same length only to make the example
easier to follow. In practice, you may compare strings of different lengths with all of
the distance comparison algorithms in FuzzyTools.
The code for Fuzzy_GetLCSLength is based on notes and code found at the location
below:
http://www.ics.uci.edu/~eppstein/161/960229.html
24
Note
W
or
d
Sc
or
e
E
d
i
t
0
L Notes
C
S
Ad
kin
son
1
.
0
0
0
1
0
0
.
0
Ad
kin
s
0
.
9
5
6
9
5
.
6
Atk
ins
on
0
.
9
4
8
9
4
.
8
7 Atkinson deserves a
high score because it
differs from the original
by only one character:
Adkinson
Atc
hin
son
0
.
8
8
2
8
8
.
2
Ap
ple
0
.
5
5
6
5
5
.
6
Blu
eb
err
y
0
.
0
7
0
7
.
0
0 If anything, this
weighted score is too
high.
The string comparison algorithms implemented here are not biased towards English
and should work well with any language. However, they may be biased towards leftto-right word order and may not prove as accurate with right-to-left word order.
Note that the various statistical algorithms implemented in
Fuzzy_GetDistancePercentage are not as simple as the edit distance and LCS
algorithms. Internally, the statistical techniques give preferential weighting to various
factors, such as similarities nearer to the front of the word. Therefore, there is a high
but imperfect correlation between edit distance scores and weighted distance scores.
As an example, the Lynch algorithm considers Adkins more similar to Adkinson
than to Atkinso while the edit distance algorithm ranks them in the opposite order.
25
It also makes sense that differences in edit distance counts and LCS lengths are likely
to be more meaningful when comparing longer strings.
26
S
t
r
a
t
e
g
y
M
et
h
o
d
S
t
a
t
i
s
t
i
c
a
l
S
i
m
i
l
a
r
i
t
y
F
uz
zy
_
G
et
Di
st
a
nc
e
P
er
ce
nt
a
g
e
E
d
i
t
D
i
s
t
a
n
c
e
F
uz
zy
_
G
et
E
di
t
Di
st
a
nc
e
C
o
u
nt
O
u
t
p
u
t
I
d
e
n
t
i
c
a
l
R 1
e
a
l
f
r
o
m
0
1
L 0
o
n
g
i
n
t
c
o
u
n
t
27
nt
L
o
n
g
e
s
t
C
o
m
m
o
n
F
uz
zy
_
G
et
L
C
S
Le
n
gt
h
L
o
n
g
i
n
t
l
e
n
g
t
h
S
t
r
i
n
g
l
e
n
g
t
h
S
u
b
s
e
q
u
e
n
c
e
The scales used by Fuzzy_GetEditDistanceCount returns a longint count,
Fuzzy_GetLCSLength returns a longint length, and Fuzzy_GetDistancePercentage
returns a real percentage. Each of these approaches makes sense for each respective
algorithm, but they cause confusing when working with scores. To simplify the
system, the Fuzzy_GetDistancePercentage routine can produce any of the six possible
scores (Jaro, Winkler, McLaughlin, Lynch, Edit, and LCS) as a percentage. Internally,
raw edit distance and LCS scores are converted into a percentage to make them
comparable with the results from the statistical functions. This feature makes it a lot
easier to compare the different tools and use them together, and is particular handy
when calling Fuzzy_FindByDistancePercentage, which always expects a percentage.
You still have access to raw edit distance and LCS scores and the routines to convert
them to percentages them, if you prefer.
28
been more confusing to use because statistical, edit distance, and LCS scores don't
use a common number line or scale. To simplify dealing with the three different
systems, the Fuzzy_GetDistancePercentage routine can produce any of the six scores,
adjusted to a percentage. Therefore, you don't have to deal with the differences in the
scoring systems, if you don't want to. Likewise, Fuzzy_FindByDistancePercentage can
use any of the six algorithms starting from a threshold value expressed as a
percentage. The raw edit distance and LCS score-based systems are included, if you
have an application for the raw scores. Finding by edit distance scores can, for
example, be substantially faster than by the statistical methods.
Lynch
5
Zysko
Zksko
Zyki
Risko
Sykor
McLaughlin
3
Zysko
Zksko
Zyki
Winkler
3
Zysko
Zksko
Zyki
Edit
2
Zysko
Zksko
LCS
2
Zysko
Zksko
As you can see from this sample, you can get very different results from these
algorithms. Using the six algorithms with the threshold specified matches six unique
words. Notice that no one algorithm matched all six words. The thresholds selected
29
also make a huge difference in the number and quality of hits found. A different
example from the surnames samples helps illustrate this. Starting from the name
Abate, below are the number of matching surnames out of 15,557 unique
possibilities based on various absolute edit distances:
Edit Distance
0
1
2
3
4
Matching Names
1
1
19
331
2,500
Depending on your data, you may find that one approach works far better than
another or, more likely, that using a combination of approaches delivers the best
results. Experimentation is the best way to find out. Still, there are some common
strategies to consider, two of which are sketched below:
For time-sensitive situations, such as a quick lookup on names, use the Metaphone
phonetic matching algorithm to find likely matches and then use the edit distance or
Lynch weighting algorithms to sort the possible matches.
For more time-consuming full-record comparisons, calculate more than one distance
measurement. For example, you can use the edit distance with a low difference
score to generate a strong list of possibilities quickly, and then refine the list by
subselection based on Lynch scores.
------------------------------------------------------------------------------------------------------------------------------------------------
One of the most powerful applications of fuzzy matching is to linking records from
different databases or duplicate records within a database. The sample database's
Duplicate Report feature, available in the Show People screen, implements an
example for your review. This example and fuzzy duplicate matching are discussed in
more detail in Technical Note 06-20 Data Cleaning and Deduplication.
Fuzzy matching is a large and intensively researched subject. As the volume of data in
databases, the Web, and other repositories increases, so too expands the need for
better and faster fuzzy matching techniques. The FuzzyTools component implements
some of the best-known and straightforward algorithms, but there are many more to
be considered. If you find these tools inadequate, you should investigate other
approaches, a few of which are sketched next.
30
sensitive to word order than the distance measures provided in FuzzyTools. N-grams
would, for example, do a better job of recognizing the similarity of the strings TF
Flannery and Flannery, TF. And, as noted above, an n-gram analysis would do a
better job of recognizing similarities in a concatenated value than any of the distance
measures implemented here.
WordList Comparison
Word lists offer a very simple form of full-record or full-text comparison. The sample
database includes a couple of routines not found in the FuzzyTools component that
show how this process works. In practice, you combine the values from some target
fields, such as name and address, and then extract the unique list of words from
them. So:
David Adams
711 Medford Center
Gives you the word list
711
Adams
Center
David
Medford
You can then use this word list to compare records. For example, a record with the
same address but with "Center" abbreviated as "Ctr" appears similar:
711
Adams
Center
Ctr
David
Medford
Match
Match
No match
No match
Match
Match
31
Overview
While the FuzzyTools component is a complete toolkit, it may not meet all of your
needs exactly. For instance, you may add a new edit distance algorithm, modify how
parameters are structured internally, remove algorithms you dont require, or rewrite
the error strings. If you do change the source, this section includes a few notes and
suggestions that may be helpful.
32
This contract isnt complicated, but its quite powerful. The private routines dont need
to do any error testing on inputs, ever. Because the gateway/dispatching/interface
routine that calls them promises to provide good inputs, the internal routines can
assume the inputs are good. This approach shifts the burden of error testing to one
place and leaves the internal routines free to do their job: produce a phonetic key.
Alternatively, you end up either not testing inputs properly or adding a lot of
complexity to each internal routine. If you modify the source code, Id strongly
recommend leaving the current code structure and their implicit contracts in place.
Soundex_Expected
E460
E460
G200
G200
H416
H416
K530
K530
L300
L222
L300
L222
W200
Soundex_Returned Error?
E460
E460
G200
G200
H416
H416
K530
K530
L300
L222
L300
L222
W200
If you modify the underlying source code of any of the algorithms, it is very handy to
run the various reports and check that nothing has broken. To make your life easier,
the source code database includes two test screens in the Runtime environment that
provide an interface for the test routines. The test screens are pictured below:
33
34
algorithms, cant be used for larger blocks of data. In fact, the distance algorithms are
used to compare protein strings, documents, and other very long sequences. If you
decide to adapt the existing code to handle larger strings, also please review Technical
Note 05-42 Scanning Text and BLOBs Efficiently.
Summary
------------------------------------------------------------------------------------------------------------------------------------------------
There are a wide range of tools available for fuzzy matching, including the seven
phonetic and six approximate string-comparison algorithms implemented in
FuzzyTools. These tools are particularly powerful when combined. Developers can use
FuzzyTools to help improve lookups when exact spellings arent known and to identify
possible duplicate values or records. As a general rule, the Metaphone phonetic
algorithm is the best provided for English names and the edit distance algorithm is
fast at approximate string matching. The sample database gives you a platform for
testing your own data.
35