Spell Checker and string matching Using BK tree

Shridhar Shah
Department of Computer Science and Technology
Institute of technology, Nirma University
Ahmedabad, India
14bce111@nirmauni.ac.in


 BK Tree or Burkhard Keller Tree is a data structure
that is used to perform spell check based on Edit
Distance (Levenshtein distance) concept.
 BK trees are also used for approximate string
matching.
 Various auto correct feature in many soft wares can
be implemented based on this data structure.
INTRODUCTION


 For instance if we are checking a word “ruk” we will
have {“truck”,”buck”,”duck”,……}. Therefore,
spelling mistake can be corrected by deleting a
character from the word or adding a new character
in the word or by replacing the character in the word
by some appropriate one. Therefore, we will be using
the edit distance as a measure for correctness and
matching of the misspelled word from the words in
our dictionary.
EXAMPLE


 CREATE
 SEARCH(SPELL-CHECK)
OPERATIONS


 Suppose we have the dictionary data as {“BALL”,”WALL”,”TAIL”}
 The nodes in the BK-Tree will show the elements in our dictionary and
there will be exactly the same number of elements as the number of words
in our dictionary given.
 Here for given dictionary, it is n=3(three nodes).The edges between nodes
show the edit distance(Levenshtein Distance d).The first element is the
root, then we take the Levenshtein Distance d from the root and add the
next elements on the tree like this:
 LevenshteinDistance(BALL, WALL) -> 1
 LevenshteinDistance(BALL, TAIL) -> 2
 The value of d between BALL and WALL is 1 and d is 2 for BALL and
TAIL.
 Here the tree is created and each node will have only one child with same
edit distance. If a new word "MALL" is added then it cannot be added as a
child to the root as it has already had child with d=1.It is added to the node
"WALL" as it has d=1.
CREATE


CREATE
Here the tree is created and each node
will have only one child with same edit
distance.
If a new word "MALL" is added then it
cannot be added as a child to the root as
it has already had child with d=1.It is
added to the node "WALL" as it has d=1.


 Now to find the nearest correct word or the string
matching here is the example of BK tree using the
given dictionary:
SEARCH


 The simple method to find a word is start from the root and
move to left and right till the edit distance is the minimum till
the end.
 To find the corrected or misspelled word we have to define the
terms i.e. tolerance value. This tolerance value(T) is highest
edit distance from our misspelled word to the correct words in
our dictionary.
 BK tree is constructed based on edit distance calculated and
searching for misspelled word can be found out using by
searching over children with edit distance [d-T] to [d+T].
 Suppose we have an incorrectly spelled word "oop" and T is 2.
Presently, we will perceive how we will gather the normal right
for the given incorrectly spelled word.
SEARCH


 Step 1: We will begin checking the value d from the root hub. d("oop" - >
"help") = 3. Presently we will emphasize over its children having in range
[d-T,d+T] i.e [1,5]
 Step 2: Let's begin emphasizing from the most noteworthy value d i.e node
"loop" with d=4 .Now at the end we will discover its distance from our
incorrectly spelled word. d("oop","loop") = 1.
 here d = 1 i.e d <= T , so we will include "loop" to the normal right word
rundown and process its child elements having alter remove in range [d-
T,d+T] i.e. [1,3]
 Step 3: Now, we are at position "troop". Finally, we will check its distance
from the incorrectly spelled word. d("oop","troop")=2. Here again d <= T,
thus again we will include "troop" to the normal right word list.
We will continue the same for every one of the words in the range [d-
T,d+T] beginning from the root position till the base most leaf node.
 In this manner, toward the end, we will be left with just 2 expected
words for the incorrectly spelled word "oop" i.e {"loop","troop"}.
SEARCH


 The basic method to find the nearest match is take all word in the
dictionary and compare the edit distance(d) with tolerance value (T) this
will take huge amount of time i.e. O(N1*M*N2)
 where N1 is a number of words , N2 is the length of incorrect word and M
is mean size of the perfect match.
 But by using a BK tree we can reduce this time complexity in the following
manner; assuming tolerance limit(T) to be 2. Now approximately, the
depth of BK-Tree will be log N, where N number of elements. At every
level, we are visiting 2 elements in the BK tree and doing edit distance
evaluation. Therefore, our Time Complexity will be O(N1*N2*log N), here
N1 is the mean length of the string in our dictionary and N2 is the length of
the incorrect word.
ANALYSIS


 The tree is N-ary and irregular (but generally well-
balanced).
 Tests show that searching with a distance of 1
queries no more than 5-8% of the tree, and searching
with two errors queries no more than 17-25% of the
tree - a substantial improvement over checking every
node!
 Note that exact searching can also be performed
fairly efficiently by simply setting n to 0.
ANALYSIS


 BK tree usually used in the spell checking applications like in dictionary,
text editors where we write spelling wrong help in correcting the word as
it is relatively simple it has mainly three parts firstly it checks whether the
word exists in the dictionary or not, secondly find the possible fixes for
misspelled word and lastly order suggestions based on some sort of
heuristic, it takes linear time by scanning all words in the dictionary and
calculating edit distance, it is really an amazing data structure for building
a dictionary of similar words and it also used to guess the typed word like
"cat" when we wrote "cta" it works with the words from dictionary with
the help of the first word which act as a root node then with the help of the
Levenshtein distance subsequent words are attached.
 And also used in the string matching applications and various soft wares
were correct features are a prerequisite for auto-correcting the word. It has
wide application in search engines for many websites for correcting the
spelling for naïve users.
 It is a basis for the futuristic search engine and correction softwares.
APPLICATION


 http://blog.not.net/2007/4/Damn-Cool-
Algorithms-BK-Trees
 https://dzone.com /algorithm-week-bk-trees-part-1
 http://www.geeksfgeeks.org/BK-tree-introduction
/
 http://blog.mishkokyi.net/posts/2015/Jul/31/impl
ementing-bk-tree
 https://nulwords.wordpress.com/2013/03/13/the-
bk-tree -spell-checking/
REFERENCES

Spell Checker and string matching Using BK tree

Recommended

More Related Content

What's hot (20)

Similar to Spell Checker and string matching Using BK tree (20)

Recently uploaded (20)

Spell Checker and string matching Using BK tree