100% found this document useful (1 vote)
143 views35 pages

Fuzzy Matching in 4th Dimension: Sources of Error

This document discusses fuzzy matching techniques for dealing with errors in data. It describes how fuzzy matching can help match similar names and records that may have minor differences due to errors. The document then explains different phonetic matching algorithms that can be used for fuzzy matching of names and strings, and how they aim to match words based on pronunciation rather than exact spelling.

Uploaded by

Karthik Raparthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
143 views35 pages

Fuzzy Matching in 4th Dimension: Sources of Error

This document discusses fuzzy matching techniques for dealing with errors in data. It describes how fuzzy matching can help match similar names and records that may have minor differences due to errors. The document then explains different phonetic matching algorithms that can be used for fuzzy matching of names and strings, and how they aim to match words based on pronunciation rather than exact spelling.

Uploaded by

Karthik Raparthy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Fuzzy Matching in 4th Dimension

By David Adams
Technical Note 06-18

Overview
------------------------------------------------------------------------------------------------------------------------------------------------

This work focuses on the problem of string matching that allows


errors, also called approximate string matching. The general goal is to
perform string matching of a pattern in a text where one or both of
them have suffered some kind of (undesirable) corruption. Some
examples are recovering the original signals after their transmission
over noisy channels, finding DNA subsequences after possible
mutations, and text searching where there are typing or spelling errors.
A Guided Tour to Approximate String Matching
Gonzalo Navarro
People and computers look at information differently. For example, its hard for a
person to sort 100,000 values correctly but trivial for a computer. On the other hand,
its easy for a person to see the following nine names are similar but hard for a
computer to recognize their similarities:
Andre Wallace
Andre Wallac
Andre Walace

Andrew Wallace
Andrew Wallac
Andrew Walace

Andrw Wallace
Andrw Wallac
Andrw Walace

Sources of Error
Software is great at matching identical values but not usually so good at matching
similar values. Identical real-world items can end up represented in multiple, slightly
different, records very easily. The causes of error are commonplace and numerous,
including:
Mechanical data entry errors, such as typos and transcription mistakes.
Misspellings, a particular problem with proper names as data entry and lookup are
often based on a what a name sounds like rather than how it is written.
Data which can be legitimately or plausibly spelled or abbreviated in more than one
way, such as the components of names and addresses.
Values which do not have a universally consistent format, such as dates.
Flawed data imported from another database or OCR (Optical Character
Recognition) system.

Flawed data entry from Web queries or data submissions. This is an increasingly
significant source of data quality problems since Web-based systems are potentially
available to a global audience with a wide range of language and typing skills.
Names or other values that have changed spellings over time, such as surnames
that have been modified over several generations or by transportation to a new
country.
Names or other values that vary in spelling over distance, such as the regional
variations in many words, names, and abbreviations.
Whatever the original source of the error, inconsistency, or ambiguity, matching
values or full records based on similarity is a challenge for many classes of software,
including databases, data warehouses, and spell-checkers. Fortunately for us as
4th Dimension developers, fuzzy matching is a problem that well-funded groups have
to deal with. Consequently, excellent computer science has been around for several
decades to develop, test, and refine effective fuzzy matching algorithms. This
technical note explains two well-established approaches to fuzzy matching: phonetic
transcription and string distance measurements. The phonetic algorithms are biased
towards English while the distance algorithms should work well for comparison of
strings of any kind. This technical note provides detailed information on the
algorithms, their strengths, limitations, and appropriate uses.

Who Cares About Fuzzy Matching?


------------------------------------------------------------------------------------------------------------------------------------------------

Before looking into solutions in detail, lets review some scenarios that illustrate cases
when fuzzy matching is desirable or necessary.
A help desk in Bangalore takes support calls for registered users of a popular
software package. While customers are supposed to have a registration number
when they call, the support staff is trained not to refuse service if a valid customer
record can be found. Unfortunately, the support staff often fails to find valid records
because the exact spellings of last names and formatting of addresses in the
database is inconsistent. When this happens, no service is provided, leaving
legitimate customers frustrated and upset with the company. In this case, phonetic
and distance-based matching increase the chances of matching a caller to an
existing registration.
A local regional government is undertaking a project to increase voter participation
in their area. The government has records of voter registrations, voter participation,
land titles, and waste disposal fees for the past twenty years. They need to link and
consolidate these sources of data to get a realistic baseline of the existing
populations voting behavior. In this case, multiple forms of fuzzy matching enable
the council to substantially automate the linkage process.
At a busy regional hospital emergency room, the intake staff try to match incoming
patients to existing records. It is important to find matches because, otherwise, any
existing medical history on file will not be accessible to the doctor assessing the
patient. This can make a life-or-death difference if, for example, the patient has a
drug allergy. Unfortunately, the atmosphere in the emergency room is always

rushed, so intake staff take have about twenty seconds to match a patient before
giving up and creating a new "temporary" record. The staff and hospital
administrators are all aware that many existing patient records are not matched and
turn to fuzzy matching solutions.
After observation and analysis of the database and intake system, it turns out that
fuzzy matching can help at the system level and on the hospital floor. The central
database consolidates records from all of the clinics and hospitals in the area.
Because of data entry errors, inconsistencies in abbreviations, and changes in
patient addresses, many duplicate records exist. Fuzzy matching can help automate
reducing the level of duplication by comparing records for overall similarity based on
name, address, and other key fields. On the hospital floor, intake staff often search
for names based on what they hear rather than an exact spelling. Adding a phonetic
search to the system and showing the matches sorted by similarity to the search
term significantly increases the chances of finding an existing record without slowing
down the intake process.
While the scenarios above are imaginary and probably dont describe any of your
projects precisely, they are typical of how and when fuzzy matching is applied in the
real world. With these stories in mind, or any scenarios from your own work, lets look
at the tools provided with this technical note.

About the Included Materials


------------------------------------------------------------------------------------------------------------------------------------------------

Included with this technical note is a source code database, the FuzzyTools
component, a sample database that uses the component, and a selection of sample
data sets in text files for experimentation. The sample database provided includes a
screen for experimenting with phonetic keys and distance measurements for a
collection of data. Two sample data sets are included with the demonstration, one with
about 15,000 surnames and another with about 5,000 place names. To get a better
sense of how the tools in the component work, create a fresh data file and import
some values from one of your systems. For the import, prepare a text file with two
columns:
[Sample]Word
[Sample]Data_Set_Name

Alpha 80
Alpha 20

The demonstration database and FuzzyTools component are documented in detail in


Technical Note 06-19 The FuzzyTools Component. Now, lets discuss the algorithms
provided in the database. First, well consider phonetic matching.

Phonetic Transcription Algorithms


------------------------------------------------------------------------------------------------------------------------------------------------

Sounds-like searches attempt to match words based on how they would be


pronounced rather than how they appear when written down. Sounds-like searches
are useful in databases any time proper names are involved. Names are particularly
problematic because they are a primary identifying attribute and, at the same time,

are often spelled in several ways due to mistakes and historical, or regional variation.
Several algorithms have been developed over the years to address this problem.
Soundex is the most universally known phonetic name translation algorithm for
English. Some version of Soundex is found, for example, in every major SQL
database. Soundex was originally developed by the US Census in the 1880s, well
before the advent of computers. Several other, more modern, algorithms offer better
performance and pattern matching success than Soundex. Before reviewing some
options, let's review why sounds-like matching is needed at all.

Spoken Versus Written Words


The way words are written is not always a perfect phonetic representation of how a
word is pronounced. This is more true in some languages than in others. For example,
the spelling of a word in Italian is almost always enough to determine the correct
pronunciation of a word. In contrast, English spelling often has a casual relationship
with actual pronunciation. There are a variety of reasons for this, including a few listed
below:
Many words have changed their pronunciations over time and archaic pronunciations
are still found in modern spellings. For example, the silent k in knight or knife.
Many words include spellings based on transcriptions of sounds from foreign
languages, such as ancient Greek. For example the f sound in elephant is spelled
ph.
There are strong temporal differences between dialects and accents of English, and
they are significant for any database that incorporates historical data, such as
genealogy databases or many municipal databases used for statistical analysis.
There are strong regional differences between dialects and accents of English. This
is important to any database that incorporates names and other values from
different regions or countries.
Consistent, regularized spelling is a relatively modern innovation in written English
and the spelling system used is not always consistent, even to this day.

Background: Phonemes Versus Letters


Every English speaker knows that how words sound and how they are spelled are not
always the same. While universal phonetic alphabets exist, such as the International
Phonetic Alphabet, these notations are rarely used outside the worlds of linguistics
and anthropology. (And, to be fair, the not everyone in those communities agrees
about the way to transcribe various sounds.) The distinction between phonetics and
spelling is rarely taught in primary school, so most English-speakers are likely to
believe that the English language has five or six vowels:
a

English has five or six letters for representing vowels but far more vowel sounds.
Phonetically, vowels are sounds made by opening your mouth and letting the air out
while your vocal chords vibrate. (The tongue is not used in the production of vowels.)

Described differently, vowels are produced without closing the mouth. Consider the
letter o in the words below:
hot

don

poke

While the same vowel letter is used in each word, three very different sounds are
indicated h-AHHH-t, d-AWW-n, p-OWW-ke, at least in my particular American
accent.
To make matters even more confusing, the exact number of vowel sounds (vowels
and diphthongs) depends on accent or dialect. For example, contemporary Australian
English accents generally have about 20 vowel sounds and contemporary American
varieties of English typically have 12-16 vowel sounds. Despite this, their spelling
systems are fundamentally the same. (Their differences, in fact, are almost never in
letters which are pronounced.)
While weve quickly discussed vowels, the same points apply to consonant sounds. For
example, the sound f may be spelled in at least four ways in English:
frank

taffy

philip

rough

What a Phonetic Transcription Algorithm Needs to Do


Spelling systems only exceptionally include a unique alphabetic character for each
sound, even in languages like Italian or Spanish which offer highly consistent
phoneme-to-spelling rule. As mentioned earlier, Italian has 21 letters but far more
sounds. The rules for phonetic transcription depend on the letter, its neighboring
letters, and its position in the word. Letters are often parts of clusters that indicate a
sound, such as ch or sh in English, or glie or gli in Italian. Letter position also
makes a difference, in some cases. For example, the letter k is not pronounced in
English if it is the first letter of the word and followed by n. Any successful phonetic
transcription system needs to identify these sorts of rules and encode them so that
they can be applied to stored strings. Therefore, in English, a phonetic transcription
algorithm should recognize that the following words sound the same:
knight

night

nite

Common and Supported Sounds-Like Algorithms and Strategies


As the preceding discussion suggests, a phonetic transcription algorithm must be
tuned to the rules for a particular language. The rules for German are, for example, a
great deal more involved than the rules for Italian. In reality, different rules can be
developed for different accents or dialects of the same language. If you assembled
native speakers of English raised in Edinburgh Scotland, Sydney Australia, Nairobi
Kenya, Mumbai India, Brooklyn New York, the south end of Boston in Massachusetts,
Oakland California, Kingstown Jamaica, and Honolulu Hawaii, you would quickly notice
that there is no standard pronunciation. Accents are localized both in space and time
(pronunciations change over time in one location).

There are many well-known phonetic transcription algorithms for English, including
Soundex, Metaphone, Double-Metaphone, NYSIIS, Phonix, Caverphone, amongst
others. All of these algorithms perform the same task but with different letter-tophoneme rules. The FuzzyTools system implements four variants of Soundex,
Metaphone, and a tool called Skeleton Key. In practice, Metaphone appears superior
to the other algorithms implemented at dealing with surnames. The other algorithms
are included for comparison purposes and, in the case of Soundex, because the
algorithm is so widely used. You can test your own data in the sample database to
determine how each approach performs. While reading this, you may find it useful to
open the demonstration database and look at some last names or words and see how
they are encoded by each algorithm. Just select Show Words from the Demo menu
and double-click on any word. If you want to try testing out some values of your own,
you can import them. Alternatively, use the Compare Strings demonstration to enter
one or two strings and see how they are encoded. Lets review the algorithms next.

Soundex
Soundex was invented for the US Census in the 1800s to help reduce errors and to
simplify accessing records in the future. Even at that time, the Census was aware that
names were changing and the Census data would be a significant historical
demographic and genealogical data source. Because Soundex was designed as a
manual system, it is quite simple. You can read a short history of Soundex here:
http://www.archives.gov/genealogy/census/soundex.html
The following brief summary of the Soundex encoding rules is adapted from the same
page. Every Soundex code consists of a letter and three numbers, such as W252. The
letter is always the first letter of the surname. The numbers are assigned to the
remaining letters of the surname according to the Soundex guide shown below.
Zeroes are added at the end if necessary to produce a four-character code. Additional
letters are disregarded. Examples:
Washington is coded W252 (W, 2 for the S, 5 for the N, 2 for the G, remaining
letters disregarded).
Lee is coded L000 (L, 000 added).
Number
1
2
3
4
5
6

Represents the Letters


B, F, P, V
C, G, J, K, Q, S, X, Z
D, T
L
M, N
R

Disregard the vowels and semi-vowels A, E, I, O, U, H, W, and Y.


There are several other straightforward transformation rules to address double letters,
adjacent letters with the same Soundex code, prefixes, and consonants separated by
vowels. Since its introduction, many variations and refinements of Soundex have been
6

developed. Ultimately, theyre all more-or-less similar in general behavior and


performance. The FuzzyTools system implements four variations based on code by
Richard Birkby published in A Soundex Implementation in .NET, readable at:
http://www.codeproject.com/csharp/Soundex.asp
These variants are described below, according to Birkbys comments:
Method
Soundex_Knuth

Notes
Produces a four character Soundex key using code based on
Knuths The Art of Computer Programming.

Soundex_Miracode

Produces a four character Soundex key based on the manual


system used in the 1910 US Census.

Soundex_Simplified

Produces a four character Soundex key based on the original


system developed by the US Census in the 1880s.

Soundex_SQLServer

Produces a four character Soundex key emulating the behavior


of the SOUNDEX function in MS SQLServer.

Skeleton Key
The skeleton key of a word consists of its first letter, followed by the consonants in the
source word in order of appearance, followed by the vowels in the source word in
order of appearance. This key contains every letter from the original string at most
once. As an example, the word Washington is encoded as WSHNGTAIO. W is the
first letter, SHNGT are the remaining consonants, and AIO are the vowels.. The
skeleton key system is part of an approach to spell-checking discussed in Automatic
Spelling Correction in Scientific and Scholarly Text by Pollock and Zamora, a
much-cited paper first published in 1984. The rest of the strategy outlined in that
paper is not implemented here. Why is this incomplete adaptation included at all?
When we consider how to pick a good phonetic transcription algorithm, the skeleton
key method provides a helpful point of comparison. This system tends to produce
codes that are close to unique. Therefore, you get very few false positives (words
sharing a code that aren't meaningfully similar) but, consequently, it does little to
accurately match similar values.

Metaphone
The Metaphone algorithm was originally designed by Lawrence Philips in 1990 to
produce phonetic transcriptions superior to Soundex. While not perfect, Metaphone is
markedly superior to Soundex in the data sets Ive tested.
Now, lets look at Metaphones rules. Metaphone produces a variable-length code
based on an original string. If there is an initial vowel, it is retained. All other vowels
are dropped. All other letters/letter groups are recoded into one of the following
consonant sounds:
B X S K J T F H L M N P R 0 W Y
Note that what could be mistaken for an "oh" is actually a zero, used to stand in for
the English sound "th".

There are short number of exception considered for word beginnings, summarized in
the table below:
Begins With
ae
gn
kn
pn
wr
x
wh

Rule
Drop the first letter
Drop the first letter
Drop the first letter
Drop the first letter
Drop the first letter
Change to "s"
Change to "w"

Example
Aebersold
Gnagy
Knuth
Pniewski
Wright
Xiaopeng
Whalen

Transformation
ebersold
nagy
nuth
niewski
right
siaopeng
walen

Next are the standard transformations, detailed in the table below:


I O Rules/Notes
n u
t
B B Unless at the end of word after "m", as in
"dumb", "McComb".
C X If "-cia-" or "-ch-".
S If "-ci-", "-ce-", or "-cy-".
Silent if "-sci-", "-sce-", or "-scy-".
K Otherwise, including in "-sch-".
D J If in "-dge-", "-dgy-", or "-dgi-".
T Otherwise.
F F
G
Silent if in "-gh-" and not at end or before a
vowel, in "-gn" or "-gned", or in "-dge-" etc.,
as in above rule
J If before "i", or "e", or "y" if not double "gg".
K Otherwise.
H

Silent if after vowel and no vowel follows or


after "-ch-", "-sh-", "-ph-", "-th-", "-gh-".
H H Otherwise.
J J
K
Silent if after "c"
K Otherwise.
L L
MM
NN
P F If before "h".
P Otherwise.
QK
R R
S X (sh) if before "h" or in "-sio-" or "-sia-"

S Otherwise.
T X (sh) if "-tia-" or "-tio-"
0 (th) if before "h"
Silent if in "-tch-"
T Otherwise.
V F
W Silent if not followed by a vowel.
W If followed by a vowel.
X K
S
Y
Silent if not followed by a vowel.
Y If followed by a vowel.
Z S

As you can see, there are a lot of rules to Metaphone. The FuzzyTools Metaphone
implementation is based on the Metaphone.java source code file found in the Apache
Jakarta Project Codec source, available here:
http://jakarta.apache.org/site/downloads/downloads_commons-codec.cgi
Since its introduction, various refinements of Metaphone have been advanced, most
notably Philips Double-Metaphone algorithm, released in 2000 and readable here:
http://www.cuj.com/documents/s=8038/cuj0006philips/
The commented C++ source code for Double Metaphone runs to over 850 lines and is
not implemented in FuzzyTools. Ultimately, ad-hoc algorithms such as Metaphone,
NYSIIS, Soundex, and so on reach a dead end. Each time a new transcription rule or
exception is recognized, the code has to be rewritten. Well discuss more flexible ruledriven strategies a bit further on. Before that, however, lets address the most
immediate and practical question: how to pick an algorithm to use with your data.

Picking a Phonetic Algorithm


For any particular word or data set, you may find one algorithm better than another at
matching likely duplicates. You may find cases where one variation of Soundex works
better than another, where one version of Soundex works better than Metaphone, or
where Metaphone works better than Soundex. Overall, Metaphone seems to be the
best phonetic transcription algorithm for English names implemented here, and one of
the best general-purpose algorithms available. The FuzzyTools systems supports
building Metaphone keys of up to 4 or up to 6 characters. You may find better results
with one or the other, depending on your data. But how do you determine for yourself
which algorithm is best in your case? Test it. The sample database was built largely to
provide a platform for testing your own data. Just import a text file with strings and a

common data set name into the [Sample] table and the system prepares all of the
keys need for testing. The fields to import into are defined below:
[Sample]Word
[Sample]Data_Set_Name

Alpha 80
Alpha 20

Comparing the Available Algorithms


Once you have your own data, or some of the sample data provided, in the example
database, you can start to compare the various algorithms using the Show Words
demonstration. When you start this demonstration, a new process automatically starts
showing the sample strings stored in the database. You can query this data to reduce
the selection or quickly select only the members of one data set. The sample data
input form includes three pages that provide tools for testing or analyzing the
algorithms. The first page, pictured below, compares how the available phonetic
algorithms perform. In this case, the base word used for all comparisons is the last
name Zysko.

Tip

When working with the data, it is best to run the database compiled for better
performance.

10

The data from the screen above is repeated below for legibility. (Metaphone4 and
Metaphone6 produce the same results, in this case, so they are only shown once
below.)
MethodKnuth Miracode Simple SQLServer Skeleton Key Metaphone
Z200
Z200
ZYSK
SSK
CodeZ000 Z000
Matches
3
3
5
6
1
8
Zksko Zksko
Zackay Zackay
Zysko
Cisco
Zyki Zyki
Ziak
Ziak
Sasaki
Zysko Zysko
Zug
Zksko
Seesock
Zyki
Zug
Sisco
Zysko Zyki
Siscoe
Zysko
Sisk
Suzuki
Zysko

Several points should be clear from this example:


Every approach succeeds in matching the base word Zksko to itself, as you would
expect.
The different approaches produce very different results. Even the four Soundex
variants show markedly different results.
Notice that the Simple and SQLServer variants both produce the code Z200 but do
not match the same words. The SQLServer version of Soundex produces the same
code for Zysko and Zksko while the simple version does not. This is a key detail
to observe if you design statistical measures of the algorithms. You should not only
measure the distinct keys produced but also the number of values clustered on each
key.
Skeleton key isnt terribly useful.
Metaphone does a much better job of finding likely matches. For example, none of
the Soundex variants find Cisco, Sisco, or Siscoe, names potentially sounding
very much like Zysko.
No one algorithm found all of the match words.
Its well worth considering a wide variety of examples as its easy to over-generalize
conclusions based on individual example. For example, in the surnames data set in
the sample database, the name Kammerdiener suggests that Metaphone6 is better
than Metaphone4 while Kalberg suggests the opposite. Lets consider for a moment
false positives and false negatives.

False Positives and False Negatives


A perfect phonetic matching system finds all strings that might reasonably expected to
be pronounced identically and doesn't find any strings that would not be pronounced
similarly. None of the algorithms implemented here are perfect. All of them match
false positives (words that dont sound alike) and miss true positives. Even without
being perfect, a fuzzy phonetic match can be enormously helpful. As much as the tone

11

of the comments about Soundex has been disparaging, it can still be a help in the real
world. A harder problem to contend with are false-negatives/missed-positives. Well
consider this point again later when comparing the results of using phonetic and
distance-matching techniques to locate matches. Briefly, however, you should know
that weighted distance measurements find far more true positives but with a higher
runtime speed cost than phonetic matching against stored phonetic codes.

Combining Techniques
Two imperfect fuzzy matching systems, when combined, can produce better results
than either one alone. For example, you could use Metaphone4 to match words and
then rank the results to give extra weight to words also matched by Soundex. A
common and very powerful way of combining techniques is to start with a phonetic
match and then refine or rank the results using one of the string-comparison
algorithms discussed later. You can see this idea in action on the second screen of the
sample word input form, pictured below:

This page presents an alphabetic list of all of the words found by any of the phonetic
matching algorithms. Phonetic matching columns marked with an x indicate the
specific algorithm matched the word in that row. In the example above, "Cisco" was
only matched by Metaphone4 and Metaphone6 while "Zackay" was matched by the
Simple and SQLServer variants of Soundex.

12

On the right side of the screen six different measurements are listed, derived from the
string distance algorithms discussed later. Note that the Jaro, Lynch, McLaughlin, and
Winkler algorithms return real numbers on the scale 0=unlike strings and 1=identical
strings. Therefore, smaller values signify less similarity. The "Edit" (edit distance)
algorithm returns an absolute count of the number of changes required to transform
one string into another. Therefore, the score increases as strings are less similar. The
LCS (Longest Common Subsequence) score is the length of the longest shared
subsequence within the two strings, ignoring non-matching characters. Therefore,
scores increase as the strings are more similar. If you don't want to worry about the
differences between these scoring systems, tick the Normalize number scales check-box

Well return to these weighting schemes later, but, for now, notice that the matched
words can be ranked based on their calculated similarity. For example, the values are
listed below ranked by their raw edit distance scores and Lynch weights compared
with Zysko. Raw scores and percentages are shown, for comparison.
Sample
Word
Zysko
Zksko
Zyki
Sisk
Sisco
Ziak
Siscoe
Suzuki
Cisco
Zackay
Sasaki
Seesock
Zug

Raw Scores
Percentages
Edit LCS Lynch Edit LCS
Lynch
0
5 1.000 100.0 100.0 100.0
1
4 0.886 80.0 80.0
88.6
2
3 0.862 60.0 60.0
82.6
3
2 0.723 40.0 40.0
72.3
3
2 0.720 40.0 40.0
72.0
3
2 0.710 40.0 40.0
71.0
4
2 0.687 33.0 33.0
68.7
5
2 0.683 16.0 33.0
68.3
3
2 0.680 40.0 40.0
68.0
4
2 0.653 33.0 33.0
65.3
4
2 0.651 33.0 33.0
65.1
6
2 0.630 14.2 28.5
63.0
4
1 0.560 20.0 20.0
56.0

As you can see, the higher-quality match suggestions are ranked similarly by the
different similarity calculating algorithms. The rankings are not, however, identical.
You can combine the findings of multiple distance calculation schemes if you wish to
attempt to automatically refine a list of matches more precisely.
Apart from ranking suggestions, it is possible to filter results to avoid too many false
positives. In the case above, filtering out matches with an edit distance over 3
reduces the list of possible matches to a fairly reasonable set of candidates:
Zysko
Zksko
Zyki
Cisco
Sisco

13

Sisk
Ziak
Note

The edit distance algorithm implemented here is called the Levenshtein distance, the
same algorithm used by a wide range of spell-checkers to rank word suggestions.
Limiting the results to matches with a Lynch weight of .70 (70% similarity) or higher
produces a nearly identical list:
Zysko
Zksko
Zyki
Sisk
Sisco
Ziak
While still imperfect, it may be better to reduce the possibilities to a reasonable
number when presenting choices to a user or automatically performing duplicate
checking scans. Well look at how to construct duplicate checking scans and the
distance-calculating algorithms more below. First, lets follow-up on a topic I
mentioned: rule-driven phonetic transcription algorithms.

How Many Phonetic Translation Algorithms Are There?


------------------------------------------------------------------------------------------------------------------------------------------------

Overview
Its easy to believe that there are dozens of different phonetic translation algorithms
available, when you consider the many variants of the basic algorithms outlined
already. In fact, I think its more reasonable to say that there is really only one
algorithm implemented with different hard-coded rule sets . The algorithm looks like
this in pseudo-code:
Scan through a block of text from start to end
Preprocess

the text to remove unwanted characters, normalize

case, or manage special exceptions

For (Each character in the text)


Test each character using some combination of attributes from the set
character value
character is a vowel
character is the first character in the source word
character is within a certain distance of the end of the source word
character is preceded by one of a set of defined characters
character is followed by one of set of defined characters or strings
Depending on the attributes considered and their values, the original character is
retained or
replaced or
discarded
End for

14

Post-process

the text to pad or trim the result string

Primarily, the rules depend on each character's value, position, and neighbors. Why,
then, are there so many different algorithms? A related question is why are there so
many algorithms for phonetic transcription of a single language? The answer to both
questions is the same: these algorithms have been developed in an ad-hoc manner to
capture the rules of a specific accent without significant assistance from linguists. In
fact, linguists consider phonetics to be a rule-driven system. It only makes sense to
solve the problem with a rule set. This approach is a perfect example of what is often
called table-driven or data-driven programming.

Keep Rules Out of Code!


The common approach of embedding the text transformation rules in code has
numerous drawbacks:
The rules are difficult to extract, document, or read.
Each time you find a new rule or exception, the method needs to be updated.
As more rules are added, the code increases in complexity.
The basic transformation task becomes bound up in a specific set of rules. Such
rules are, necessarily, limited to an assumed way of pronouncing words (or two
ways, in the case of Double Metaphone). This makes the code effectively worthless
if youre dealing with an accent with a substantially different phonetic profile.
The code cant be readily adapted to another language.
The code has to be thrown away if the rules change. The Caverphone algorithm, for
example, was developed because none of the existing systems are rule-driven.
Caverphone, in turn, embeds the rules for phonetically transcribing proper names as
spoken by people in southern Dunedin, New Zealand during the years 1893-1938.
Even if you arent concerned about building a system that can address the
transformation rules required for different accents or languages, the code itself should
bother you. Consider the code for dealing with the letter C in Metaphone, first in
Java:
case 'C' : // lots of C special cases
/* discard if SCI, SCE or SCY */
if ( isPreviousChar(local, n, 'S') &&
!isLastChar(wdsz, n) &&
(this.frontv.indexOf(local.charAt(n + 1)) >= 0) ) {
break;
}
if (regionMatch(local, n, "CIA")) { // "CIA" -> X
code.append('X');
break;
}
if (!isLastChar(wdsz, n) &&
(this.frontv.indexOf(local.charAt(n + 1)) >= 0)) {
code.append('S');
break; // CI,CE,CY -> S
}

15

if (isPreviousChar(local, n, 'S') &&


isNextChar(local, n, 'H') ) { // SCH->sk
code.append('K') ;
break ;
}
if (isNextChar(local, n, 'H')) { // detect CH
if ((n == 0) &&
(wdsz >= 3) &&
isVowel(local,2) ) { // CH consonant -> K consonant
code.append('K');
} else {
code.append('X'); // CHvowel -> X
}
} else {
code.append('K');
}
break ;

The same behavior is implemented the sample database in the private


FuzzyP_GetMetaphoneCode routine:
: ($current_character="C")` // lots of C special cases
` Build test substrings:
If ($character_index>1)
$substringStartingOnPrevious_s3:=Substring($working_string;$character_index-1;3)
Else
$substringStartingOnPrevious_s3:=""
End if
$substringStaringOnChar_s3:=Substring($working_string;$character_index;3)
C_BOOLEAN($wereNotOnTheFirstCharacter_b)
$wereNotOnTheFirstCharacter_b:=$character_index>1
C_LONGINT($distanceFromLastChar_index)
$distanceFromLastChar_index:=$working_string_length-$character_index
Case of ` The order of some of these cases is important!
: ($substringStartingOnPrevious_s3="SCE")` Skip
: ($substringStartingOnPrevious_s3="SCI")` Skip
: ($substringStartingOnPrevious_s3="SCY")` Skip
: ($substringStaringOnChar_s3="CIA")` Cia --> X
$workingCode_s40:=$workingCode_s40+"X"
: ($substringStaringOnChar_s3="CI@")` Ci --> S
$workingCode_s40:=$workingCode_s40+"S"
: ($substringStaringOnChar_s3="CE@")` Ce --> S
$workingCode_s40:=$workingCode_s40+"S"
: ($substringStaringOnChar_s3="CY@")` Cy --> S
$workingCode_s40:=$workingCode_s40+"S"
: ($substringStartingOnPrevious_s3="SCH")` sCh --> K
$workingCode_s40:=$workingCode_s40+"K"
: ($substringStaringOnChar_s3="CH@")` cH+consonant --> K, cH+vowel--> X
C_BOOLEAN($isAtEndOfWord_b)
C_BOOLEAN($isFollowedByAVowel_b)
$isAtEndOfWord_b:=$character_index+2>=$working_string_length

16

$isFollowedByAVowel_b:=Metaphone_CharacterIsAVowel (Metaphone_GetNextChararacter
($working_string;$character_index+1))
Case of
: ($character_index=1)
$workingCode_s40:=$workingCode_s40+"K"
: ($isAtEndOfWord_b)
$workingCode_s40:=$workingCode_s40+"X"
: ($isFollowedByAVowel_b)
$workingCode_s40:=$workingCode_s40+"X"
: ($isFollowedByAVowel_b=False)
$workingCode_s40:=$workingCode_s40+"K"
End case
Else
$workingCode_s40:=$workingCode_s40+"K"
End case

I picked the case of C deliberately as it is one of the most complex in the Metaphone
system. In Double Metaphone, the C is even more complex. The rules shown above
are easier to follow if extracted:
If
If
If
If
If
If
If
If
If
If
If
If

the
the
the
the
the
the
the
the
the
the
the
the

C
C
C
C
C
C
C
C
C
C
C
C

is within 'SCE', skip


is within 'SCI', skip
is within 'SCY', skip
begins 'CIA', transform to 'X'
begins 'CI', transform to 'S'
begins 'CE', transform to 'S'
begins 'CY', transform to 'S'
is within 'SCH', transform to 'K'
is within CH, and the CH is at the start of the word, transform to 'K'
is within CH, and the CH is at the end of the word, transform to 'X'
is within CH, and the CH is followed by a vowel, transform to 'X'
is within CH, and the CH is followed by a consonant, transform to 'K'

Presented tabularly, its easier to see how rules can be defined in a readily machineprocessable format:
Followed by Followed by
Consonant? Output
Rule Type Pattern Start of Word? End of Word? Vowel?
Char within SCE
Char within SCI
Char within SCY
X
Char starts CIA
Char starts CI
S
Char starts CE
S
Char starts CY
S
Char starts SCH
K
Char starts CH
TRUE
K
Char starts CH
TRUE
X
Char starts CH
TRUE
X
Char starts CH
TRUE
K

17

Note

The rules listed above are included to expand on the discussion of data-driven
programming. I may have extracted the rules with some flaws and have not doublechecked them through code.
Using a table of rules like these, a rule-processing engine can scan through a block of
text from start to finish without any need for special cases or custom logic embedded
in the code itself. In a full implementation, additional rule types are required by some
algorithms to, for example, pre-transform specific patterns before scanning through
the source string.
There are numerous advantages to using a table-driven strategy:
Since the rules are defined as data, they can be read from multiple sources. For
example, you can add a routine to read through the rules and automatically produce
human-readable descriptions or write a routine that generates various test patterns
to exercise the rules.
The rule-sets are no longer buried in the code, so the source code is dramatically
reduced in size.
Adding support for a new accent or language doesn't require new coding, just a new
rule set. If the rules are defined in records, the database won't even need to be
recompiled. The exception to the no recoding rule is if a particular rule-set
requires a new rule type, such as replace the current character if it is equal to a
specific value and is exactly 3 characters from the front of the word. It is certainly
the case that certain writing systems and spoken languages would require rule types
not shown in the small table above. For example, properly transcribing German
presents special challenges because of the way word spellings are modified when
words are compounded.
No knowledge of programming is required for defining rules. A linguist can help
develop rules that are then fed into the rule processing engine.
Translating the system between computer languages is now much simplified. Only
the rule-processing engine and rule-storage system need to be adapted.

Like any data-driven system, if theres a bug in the processing engine, you only

need to fix it once to fix the system. On the other hand, when there are bugs in the
rule-set, they dont hurt the engine or stop the engine from processing properly
formulated rule-sets correctly.

Storing rules as data makes it possible to switch rules on-the-fly based on other

inputs. It's fair to think of rule sets as parameters, in this scenario . For example, a
Web-based name database could allow users to identify their location, or guess it
from their IP address, and use that as a basis for selecting the default accent ruleset to apply. Instead of switching algorithms, the code simply switches rule sets. Its
easy to imagine loading a special rule-set for an Italian speaker using an Englishlanguage data source.

18

So Why Doesnt Everyone Use a Rule-Driven Approach?


In fact, proprietary transcription and translation software do use rule-driven strategies
or even more sophisticated approaches. The common, publicly available systems tend
to be ad-hoc because they're easier to code initially. It takes more time to define rule
sets with a linguist, and develop a rule set data structure and processing engine.
Ultimately a rule set is a small token language and the engine acts as a singlepurpose interpreter. While manageable in scope, there needs to be justification for
such a project. If you are working on a multi-lingual/multi-accent system, a ruledriven approach is well worth considering.
Now let's turn to a different class of fuzzy matching tools, approximate string
matching.

String Similarity Measurement Techniques


------------------------------------------------------------------------------------------------------------------------------------------------

Overview
Approximate string comparison algorithms are another fundamental approach to fuzzy
matching. Instead of converting the strings into another form, as the phonetic
transcription algorithms do, strings are compared directly to calculate their relative
similarity. The degree of similarity is expressed as a percentage, count, or length,
depending on algorithm. These measurements of similarity/difference are often called
distances. There are some major advantages to string distance measurement tools:
Strings are compared directly and completely. This is quite different from phonetic
algorithms where some of the original data is lost or transformed in the process.
The distance calculating methods do an excellent job of finding real positives and
avoiding false negatives of under a wide range of conditions.
The numeric distance measures enable a developer to rank possibilities, sometimes
quite exactly. This is exactly how, for example, most spell-checkers order
suggestions. The most similar words are proposed first according to their relative
similarity.
These algorithms are not tied to data in English or any other language. In fact,
approximate string comparison algorithms are used to compare DNA sequences,
which have only four letters in their alphabet, although with very long words.
The main disadvantage of approximate string comparison techniques is that the
comparisons are between any two strings. If you want to compare a record with every
other record in a table, thats a lot of comparisons:
1 * (records in table -1)
1 is subtracted above because it is not necessary to compare a record with itself.
Under the right circumstances, this approach may be fast enough to use on-the-fly
during data entry. If, however, an entire table needed to be scanned for duplicates,
there are a lot more comparisons, roughly:
(records in table * records in table) / 2

19

If you have 10,000 records to compare, thats 50,000,000 comparisons. (The actual
figure is 49,995,000. Well look at the details on this figure later.) There are ways to
intelligently reduce the number of records that require testing for duplicates, of
course, but direct sequential comparison of an entire table is obviously slower than
searching on an indexed value, such as a stored phonetic key.
Note

When comparing records using any of the distance-calculation routines, remember


that the number of records found or threshold used makes no difference to the search
time. Regardless of how many records are found, all of the records in the current
selection must be compared. To reduce search speed, reduce the starting selection by
some means. For example, if you have an employee database with a reliable gender
field, eliminate from consideration one gender. (In medical applications, gender is
often not a two-value field.)

Dont Be Scared Off


I want to address the limitation on approximate string-comparisons immediately but
dont want to scare you off. The approximate string comparison routines are the most
powerful code in FuzzyTools. To see them in action, try running a compiled version of
the sample database. Open a sample word and look at page two, pictured below:

All of the weighted columns are calculated on-the-fly when you open the record. Try
moving between records and see if there is any delay. On modest modern equipment,

20

I cant detect one. Note that this form performs at least five indexed queries in the On
Load phase to load the values onto page 1, apart from the distance calculations. On a
small scale, the distance calculation routines are instantaneous. To see how they work
in the context of a full table search, turn to page 3, picture below:

This screen lets you interactively test searching for related words based on a weighted
threshold. In the example above, the system searched for words that are at least 80%
similar to the starting word Zysko. (Well look at what distance values mean
shortly.) This search is not instantaneous, but it can compare over 15,000 records in
less than 2 seconds under 4D Server. These tests were run on modest contemporary
equipment in compiled mode. Finding by raw edit distance can be quite a bit faster
than the find by weight algorithms available on this screen. Its well worth importing
some of your own data into the test database in a fresh data file and seeing how it
performs. Now, lets look at the individual distance measurements in more detail.
Note

The Find By All Methods button runs all six comparisons at once, taking several times
longer than testing any particular method.

Weighting Scoring Algorithms


There are several statistical algorithms for calculating the difference between two
strings. Amongst the most often cited are Jaro and Jaro-Winkler. The FuzzyTools

21

Fuzzy_GetDistancePercentage method implements four related algorithms based on


Jaro, as listed below:
Method
Jaro

Notes/Source
The most basic algorithm.

Winkler

Refines the weight produced by Jaro.

McLaughlin

Refines the weight produced by Winkler.

Lynch

Refines the weight produced by McLaughlin.

The original paper outlining the distance algorithms is listed below:


Advanced Methods for Record Linkage 940920
William E. Winkler
Bureau of the Census
Washington DC 20233-9100
[email protected]
http://www.census.gov/srd/papers/pdf/rr94-5.pdf
A version of the algorithms in C can be found at the Census site:
http://www.census.gov/geo/msb/stand/strcmp.c
I found the C code incomprehensibly dense and based the FuzzyTools implementation
on Pascal code by Andr Cipriani Bandarra found here:
http://www.bandarra.org/pascal/stringcomparison/
Of course, you dont need to read the Pascal, C, or 4th Dimension code to use the
weighted distance calculating algorithms. All you need to do is call
Fuzzy_GetDistancePercentage with the name of one of the algorithms and the two
strings to compare. The example below uses the Lynch algorithm to compare the
strings massey and massie.
C_REAL($distance_weight)
$distance_weight:=Fuzzy_GetDistancePercentage ("Lynch";massey";"massie")

Edit Distance
One way to quantify the difference between two strings is by counting how many
differences they have, a measurement commonly called an edit distance. Edit
distances are a primary tool in spell-checkers and are sometimes used in database
comparisons. The Fuzzy_GetEditDistanceCount routine implements one of the bestknown of these algorithms, called the "Levenshtein distance". The name comes from
Vladimir Levenshtein, the scientist who first developed this system in 1965. This
algorithm returns a count of how many additions, deletions and substitutions are
required to transform one string into another. The more differences there are between
the strings, the more steps are required to make them identical and, therefore, the
higher the distance count. Identical strings require no transformations and, therefore,

22

return a count of 0. For an example of two unlike strings, the edit distance between
"kitten" and "sitting" is 3:
0
1
2
3

kitten
sitten
sittin
sitting

Substitute 's' for 'k'.


Substitute 'i' for 'e'.
Insert 'g' at the end of the word.

As this example illustrates, the two strings don't need to be of the same length to be
compared. Many papers and articles on fuzzy matching include variations on the
original Levenshtein algorithm. For example, different transformations can be given
different costs, or certain substitutions can be given a discounted cost to, for example,
allow for common letter transpositions. In FuzzyTools, I stuck with the original
Levenshtein approach since it is simple to understand and code, fast, effective, and
blind to the language a string contains. Also, I couldn't find any empirical evidence
that the modified edit distance routines are better, generally, than the original in realworld tests. If you are interested reading more about the Levenshtein distance, visit
the helpful Wikipedia entry listed below:
http://en.wikipedia.org/wiki/Levenshtein_distance
The Fuzzy_GetEditDistanceCount routine is a straight translation of the C code by
Michael Gilleland found at:
http://www.merriampark.com/ld.htm
A 4th Dimension developer won't need to understand the Levenshtein distance
algorithm or the 4th Dimension code used to implement it, but only to learn how to
call the Fuzzy_GetEditDistanceCount routine, as illustrated in the example below:
C_LONGINT($distance_count)
$distance_count:=Fuzzy_GetEditDistanceCount ("massey";"massie")

Note that Fuzzy_GetEditDistanceCount returns an absolute count (longint) and


Fuzzy_GetDistancePercentage returns a real between zero and one. As a convenience,
Fuzzy_GetDistancePercentage can also generate edit distance scores converted to a
percentage. Edit distances can be converted to percentages by hand with a call to
Fuzzy_EditDistanceToPercentage.

Longest Common Subsequence


A third strategy for comparing two strings, or streams of bytes, is called Longest
Common Subsequence, or LCS. This technique is commonly used to compare DNA
sequences, for example. This approach is somewhat similar to finding the longest
matching substring. For example, imagine these two strings:
John Anderson
Jon Anderrsen

23

If you were finding the longest matching substring, you would end up with the result
highlighted below:
John Anderson
Jon Anderrsen
The substring above is 5 characters long (Ander) out of 13 in the original strings.
Converted to a percentage, that's a similarity of a bit over 38%. Just from looking at
the strings, the score seems too low. The LCS algorithm, implemented in FuzzyTools's
Fuzzy_GetLCSLength routine, finds the longest common subsequence instead of the
longest substring. The difference is that a subsequence ignores non-matching
intervening characters. So, in comparing "Jon Anderrsen" to " John Anderson", the
pattern highlighted below registers as a match (the space character matches but can't
be highlighted):
John Anderson
Jon Anderrsen
Notice that Jon and John match because the characters J-o-n appear, in order, within
J-o-h-n. The h in John is simply ignored as a junk character. The longest matching
subsequence, then, is 11 characters long (Jon Andersn), giving us a similarity
percentage of roughly 85%. This score agrees much more closely with how a human
would rank the two strings.
Note

The two strings in the example above are the same length only to make the example
easier to follow. In practice, you may compare strings of different lengths with all of
the distance comparison algorithms in FuzzyTools.
The code for Fuzzy_GetLCSLength is based on notes and code found at the location
below:
http://www.ics.uci.edu/~eppstein/161/960229.html

Distance Score Examples


The distances scores are easier to understand after looking at some examples. The
Fuzzy_GetDistancePercentage routine always returns a real value between 0 and 1.
Another way to look at these weights is as percentages where 1.00 = 100%
agreement and 0=0% agreement. So, a score of .679 indicates a similarity of 67.9%.
Below are a few sample results based on comparing words to "Adkinson" using the
Lynch algorithm. Ive translated the scores into percentages (score * 100) for
convenience. I have also added the edit distance counts and LCS lengths for
comparison.

24

Note

W
or
d

Sc
or
e

E
d
i
t
0

L Notes
C
S

Ad
kin
son

1
.
0
0
0

1
0
0
.
0

Ad
kin
s

0
.
9
5
6

9
5
.
6

6 Adkins gets a high


score because it
matches the front of
Adkinson perfectly

Atk
ins
on

0
.
9
4
8

9
4
.
8

7 Atkinson deserves a
high score because it
differs from the original
by only one character:
Adkinson

Atc
hin
son

0
.
8
8
2

8
8
.
2

Ap
ple

0
.
5
5
6

5
5
.
6

Blu
eb
err
y

0
.
0
7
0

7
.
0

0 If anything, this
weighted score is too
high.

8 This is the base word


so you should expect
100% agreement.

The string comparison algorithms implemented here are not biased towards English
and should work well with any language. However, they may be biased towards leftto-right word order and may not prove as accurate with right-to-left word order.
Note that the various statistical algorithms implemented in
Fuzzy_GetDistancePercentage are not as simple as the edit distance and LCS
algorithms. Internally, the statistical techniques give preferential weighting to various
factors, such as similarities nearer to the front of the word. Therefore, there is a high
but imperfect correlation between edit distance scores and weighted distance scores.
As an example, the Lynch algorithm considers Adkins more similar to Adkinson
than to Atkinso while the edit distance algorithm ranks them in the opposite order.

25

It also makes sense that differences in edit distance counts and LCS lengths are likely
to be more meaningful when comparing longer strings.

Normalizing Similarity Scores


As you may have noticed in the example above, the three different algorithm types
produce different scores on different scales sometimes running in opposite directions
on the number line, as summarized below:

26

S
t
r
a
t
e
g
y

M
et
h
o
d

S
t
a
t
i
s
t
i
c
a
l
S
i
m
i
l
a
r
i
t
y

F
uz
zy
_
G
et
Di
st
a
nc
e
P
er
ce
nt
a
g
e

E
d
i
t
D
i
s
t
a
n
c
e

F
uz
zy
_
G
et
E
di
t
Di
st
a
nc
e
C
o
u
nt

O
u
t
p
u
t

I
d
e
n
t
i
c
a
l
R 1
e
a
l
f
r
o
m
0
1

L 0
o
n
g
i
n
t
c
o
u
n
t

27

nt
L
o
n
g
e
s
t
C
o
m
m
o
n

F
uz
zy
_
G
et
L
C
S
Le
n
gt
h

L
o
n
g
i
n
t
l
e
n
g
t
h

S
t
r
i
n
g
l
e
n
g
t
h

S
u
b
s
e
q
u
e
n
c
e
The scales used by Fuzzy_GetEditDistanceCount returns a longint count,
Fuzzy_GetLCSLength returns a longint length, and Fuzzy_GetDistancePercentage
returns a real percentage. Each of these approaches makes sense for each respective
algorithm, but they cause confusing when working with scores. To simplify the
system, the Fuzzy_GetDistancePercentage routine can produce any of the six possible
scores (Jaro, Winkler, McLaughlin, Lynch, Edit, and LCS) as a percentage. Internally,
raw edit distance and LCS scores are converted into a percentage to make them
comparable with the results from the statistical functions. This feature makes it a lot
easier to compare the different tools and use them together, and is particular handy
when calling Fuzzy_FindByDistancePercentage, which always expects a percentage.
You still have access to raw edit distance and LCS scores and the routines to convert
them to percentages them, if you prefer.

A Note on The Six Distance Routines


FuzzyTools includes two sets of distance-based routines:
Fuzzy_GetDistancePercentage, Fuzzy_GetEditDistanceCount, and Fuzzy_GetLCSLength
to calculate distances and Fuzzy_FindByDistancePercentage,
Fuzzy_FindByEditDistanceCount, and Fuzzy_FindByLCSLength to locate records based
on distances. Why are there six routines instead of two? Internally, it would have been
simpler to write a single calculation routine and a single find routine but it would have

28

been more confusing to use because statistical, edit distance, and LCS scores don't
use a common number line or scale. To simplify dealing with the three different
systems, the Fuzzy_GetDistancePercentage routine can produce any of the six scores,
adjusted to a percentage. Therefore, you don't have to deal with the differences in the
scoring systems, if you don't want to. Likewise, Fuzzy_FindByDistancePercentage can
use any of the six algorithms starting from a threshold value expressed as a
percentage. The raw edit distance and LCS score-based systems are included, if you
have an application for the raw scores. Finding by edit distance scores can, for
example, be substantially faster than by the statistical methods.

Selecting a Distance Measurement Algorithm


As just summarized, FuzzyTools implements four weighted distance measurement
algorithms, one edit distance algorithm, and one LCS algorithm. An obvious question
is: which one to use? As mentioned, all four weighted algorithms are closely related.
Jaro is the simplest, Winkler refines Jaro, McLaughlin refines Winkler, and Lynch
refines McLaughlin. Therefore, the most refined weighting algorithm is Lynch, a good
bet for your default approach to weighted results. On the other hand, the edit distance
algorithm is simple, effective, and fast. As implemented in FuzzyTools and applied to
the sample data provided, the edit distance approach is typically 10-13 times faster at
retrieving values than the weighted distance code. This comparison is somewhat
misleading since edit distances deal with an absolute measurement and the statistical
methods deal with percentages. If speed is your primary consideration, the edit
distance algorithm is your best bet. Be forewarned, however, that edit distances, since
they are absolute counts, can deliver less than ideal results. For example, if you set
the threshold to 'find words within an edit distance of 2 from my base word', you'll get
very different results if your comparison strings are longer or shorter. (2 characters in
a 5 character string means more than 2 characters in a 10 character string.) The
statistical methods are less prone to this form of distortion.
On the other hand, speed is not always the first concern when performing fuzzy
matches. The different algorithms produce different results and, therefore, match
different related values. The example below is based on the surnames data set in the
sample database. Starting with the last name Zysko, below are the surnames that
match based on various algorithms set to a threshold of 80%. (Absolute edit and LCS
scores are not used here.)
Method Jaro
Matches 2
Zysko
Zksko

Lynch
5
Zysko
Zksko
Zyki
Risko
Sykor

McLaughlin
3
Zysko
Zksko
Zyki

Winkler
3
Zysko
Zksko
Zyki

Edit
2
Zysko
Zksko

LCS
2
Zysko
Zksko

As you can see from this sample, you can get very different results from these
algorithms. Using the six algorithms with the threshold specified matches six unique
words. Notice that no one algorithm matched all six words. The thresholds selected

29

also make a huge difference in the number and quality of hits found. A different
example from the surnames samples helps illustrate this. Starting from the name
Abate, below are the number of matching surnames out of 15,557 unique
possibilities based on various absolute edit distances:
Edit Distance
0
1
2
3
4

Matching Names
1
1
19
331
2,500

Depending on your data, you may find that one approach works far better than
another or, more likely, that using a combination of approaches delivers the best
results. Experimentation is the best way to find out. Still, there are some common
strategies to consider, two of which are sketched below:
For time-sensitive situations, such as a quick lookup on names, use the Metaphone
phonetic matching algorithm to find likely matches and then use the edit distance or
Lynch weighting algorithms to sort the possible matches.
For more time-consuming full-record comparisons, calculate more than one distance
measurement. For example, you can use the edit distance with a low difference
score to generate a strong list of possibilities quickly, and then refine the list by
subselection based on Lynch scores.

Fuzzy Duplicate Matching

------------------------------------------------------------------------------------------------------------------------------------------------

One of the most powerful applications of fuzzy matching is to linking records from
different databases or duplicate records within a database. The sample database's
Duplicate Report feature, available in the Show People screen, implements an
example for your review. This example and fuzzy duplicate matching are discussed in
more detail in Technical Note 06-20 Data Cleaning and Deduplication.

Other Approaches to Fuzzy Matching


------------------------------------------------------------------------------------------------------------------------------------------------

Fuzzy matching is a large and intensively researched subject. As the volume of data in
databases, the Web, and other repositories increases, so too expands the need for
better and faster fuzzy matching techniques. The FuzzyTools component implements
some of the best-known and straightforward algorithms, but there are many more to
be considered. If you find these tools inadequate, you should investigate other
approaches, a few of which are sketched next.

N-Gram Approximate String Comparisons


In n-gram analysis, each source string is broken into chunks and the collection of
chunks are then compared to produce a weighted value. This approach is less

30

sensitive to word order than the distance measures provided in FuzzyTools. N-grams
would, for example, do a better job of recognizing the similarity of the strings TF
Flannery and Flannery, TF. And, as noted above, an n-gram analysis would do a
better job of recognizing similarities in a concatenated value than any of the distance
measures implemented here.

WordList Comparison
Word lists offer a very simple form of full-record or full-text comparison. The sample
database includes a couple of routines not found in the FuzzyTools component that
show how this process works. In practice, you combine the values from some target
fields, such as name and address, and then extract the unique list of words from
them. So:
David Adams
711 Medford Center
Gives you the word list
711
Adams
Center
David
Medford
You can then use this word list to compare records. For example, a record with the
same address but with "Center" abbreviated as "Ctr" appears similar:
711
Adams
Center
Ctr
David
Medford

Match
Match
No match
No match
Match
Match

Probabilistic Comparison Methods


Very large-scale custom and commercial data-consolidation systems apply
probabilistic approaches to record/value matching in a variety of ways. For example,
in most real-world database, the distribution of values in any particular field is likely to
be less than regular. Through statistical analysis, you can improve how the system
ranks or guesses what a new value should be. Imagine, for example, a database of
last names. In this imaginary database, a new person is added with the last name
Adames. This name has an edit distance of 1 from both Adams and Wadams.
Which is more likely? We can ask the data. If there are 1,000 instances of Adams, 3
instance of Wadams, and no instances of Adames, its fairly likely the new value
should also be Adams. Probabilistic techniques can be quite complex and
sophisticated but offer the advantage of letting the data guide and improve the data.

31

Geocoding and Proximity Matching


With improved geographic databases now widely available through Web Services and
by other means, geocoding address data is increasingly practical. Matching records
based on physical nearness is a very powerful technique with a variety of applications.
For example, when testing similar names and addresses, a proximity ranking can
enhance the ability of a program to accurately estimate the likelihood the records are
a match.

Some Notes on Modifying the Source Code


------------------------------------------------------------------------------------------------------------------------------------------------

Overview
While the FuzzyTools component is a complete toolkit, it may not meet all of your
needs exactly. For instance, you may add a new edit distance algorithm, modify how
parameters are structured internally, remove algorithms you dont require, or rewrite
the error strings. If you do change the source, this section includes a few notes and
suggestions that may be helpful.

Code Structure: Gateway Routines and Contracts


The FuzzyTools systems is packaged as a component largely to simplify rational error
checking. The protected component methods, such as Fuzzy_GetPhoneticKey and
Fuzzy_GetDistancePercentage, do extensive error testing on all inputs. As far as
possible, the protected routines will not crash or return wrong results if passed
incorrect or incomplete parameter lists. These routines, in fact, rarely do anything
more than validating parameters and then dispatching the call to private routine that
does the work. As an example, consider the call chain for Fuzzy_GetPhoneticKey,
leaving out error-handling routines:
Fuzzy_GetPhoneticKey
FuzzyP_GetMetaphoneCode
FuzzyP_GetSkeletonKeyCode
FuzzyP_GetSoundexKnuthCode
FuzzyP_GetSoundexMiracodeCode
FuzzyP_GetSoundexSimplifiedCode
FuzzyP_GetSoundexSQLServerCode
The public routine, Fuzzy_GetPhoneticKey, and the private routines it calls have a
contract. The contract is a agreement between the two pieces of code that is, in
fact, implemented as code. The rules of the contract are simple:
The Fuzzy_GetPhoneticKey protected routine will always pass
complete and correct values.
The FuzzyP_ private routines will always return a string.
All of these routines will set an error if one is encountered.

32

This contract isnt complicated, but its quite powerful. The private routines dont need
to do any error testing on inputs, ever. Because the gateway/dispatching/interface
routine that calls them promises to provide good inputs, the internal routines can
assume the inputs are good. This approach shifts the burden of error testing to one
place and leaves the internal routines free to do their job: produce a phonetic key.
Alternatively, you end up either not testing inputs properly or adding a lot of
complexity to each internal routine. If you modify the source code, Id strongly
recommend leaving the current code structure and their implicit contracts in place.

Sanity Checking Modifications


Both the phonetic and distance algorithms implemented in the system include
corresponding test routines. For example, FuzzyP_TestSoundexKnuth tests the
FuzzyP_GetSoundexKnuthCode function. Each of the test routines works in the same
manner:
A series of test values are prepared. These consist of strings with known correct
outputs for the particular algorithm.
Each test value is passed to the algorithm being tested. The results are recorded.
The inputs and outputs are formatted as a tab-delimited text block and returned. If
there is a disagreement between the expected and actual outputs, it is flagged in
this report.
Below is the output from the FuzzyP_TestSoundexKnuth routine, which looks much
like the other reports:
-----------------------------------------------------------Soundex_Knuth Test Results
-----------------------------------------------------------Name
Ellery
Euler
Gauss
Ghosh
Heilbronn
Hilbert
Kant
Knuth
Ladd
Lissajous
Lloyd
Lukasiewicz
Wachs

Soundex_Expected
E460
E460
G200
G200
H416
H416
K530
K530
L300
L222
L300
L222
W200

Soundex_Returned Error?
E460
E460
G200
G200
H416
H416
K530
K530
L300
L222
L300
L222
W200

If you modify the underlying source code of any of the algorithms, it is very handy to
run the various reports and check that nothing has broken. To make your life easier,
the source code database includes two test screens in the Runtime environment that
provide an interface for the test routines. The test screens are pictured below:

33

Scanning Text and BLOBs


Currently, the code in the FuzzyTools system work on short strings, up to 80
characters long. Since this technical note is primarily about fuzzy matching of string
fields, the 80 character limit makes perfect sense. On the other hand, there is no
absolute reason why many of these algorithms, particularly the distance-calculation

34

algorithms, cant be used for larger blocks of data. In fact, the distance algorithms are
used to compare protein strings, documents, and other very long sequences. If you
decide to adapt the existing code to handle larger strings, also please review Technical
Note 05-42 Scanning Text and BLOBs Efficiently.

Read the Comments


Look in the method comments for hints about changing the source code. For example,
if you decide to change Fuzzy_GetEditDistanceCount to handle larger strings, youll
see some important comments regarding memory consumption that apply to larger
values. For an overview of the methods, see the FuzzyD Read Me Private, FuzzyP
Read Me Private routines, as well as the Fuzzy Read Me Public routine.

Summary
------------------------------------------------------------------------------------------------------------------------------------------------

There are a wide range of tools available for fuzzy matching, including the seven
phonetic and six approximate string-comparison algorithms implemented in
FuzzyTools. These tools are particularly powerful when combined. Developers can use
FuzzyTools to help improve lookups when exact spellings arent known and to identify
possible duplicate values or records. As a general rule, the Metaphone phonetic
algorithm is the best provided for English names and the edit distance algorithm is
fast at approximate string matching. The sample database gives you a platform for
testing your own data.

35

You might also like