Levenshtein Algorithm 1 PDF
Levenshtein Algorithm 1 PDF
Problem Statement
When we try to search or spell a word, we may not know the exact spelling. In this case, we
try to fix the mistake that we made by adding a letter, deleting a letter, or replacing a letter
with a convenient one with the help of our memory and cognitive skills. However, if we have
to do this operation automatically in the computer programming system in the most
optimized way, we need to develop logical steps that find a reasonable solution for every
specific word combination. For example, in google search when we enter a wrong word,
google recommends the closest word approximately. We intend to create this using
Levenshtein distance, which is used in applications such as spell checkers, correction systems,
speech recognition, spam filtering, and plagiarism detection. The efficiency of this operation
is important because the search request triggered by users in a second is about 70,000.
Therefore, we need to use a plausible algorithm that performs well under these conditions.
The DNA comparisons problem needs the same kind of algorithm to detect differences
between two DNA structures. Therefore, the main objective is to detect the differences
between two words and find out what kind of operations should be done to the target word
to make both of them the same by comparing the strings in a variety of ways.
Algorithm Description
The Levenshtein algorithm also called Levenshtein edit distance, which means the number of
differences between two words called like distance. This variable is also used as a parameter
to check how much difference can be tolerated. The Levenshtein distance between two words
is the smallest number of single-character modifications (insertions, deletions, or
substitutions) required to transform one word into the other. It is named after Vladimir
Levenshtein, a Soviet mathematician who studied this distance in 1965.
Dynamic Programming requires that you first be able to solve similar problems, to apply the
technique to the particular problem you are trying to solve. Therefore, in this algorithm, we
have divided the problem into 2 steps to figure out the distinctions and find a solution way
according to this map.
The Process of Levenshtein consists of two parts, which are forming the matrix by
crosschecking the letters of words and giving the value for each cell according to the logic of
the algorithm and the backtracking technique to announce which operation has to be done to
fix a word at the end optimally.
Figure 1: Levenshtein Matrix
In the first step, as we can see from Figure 1, we have to create a matrix in that words are
placed on the rows and columns no matter whether the number of the letters is the same or
different. Firstly, we place the initial values on words according to their order from beginning
to end. After every letter, we increment the value by one. In both words cells next to them
are filled with values in ascending order naturally. As we can see from Figure 1, the
“RELEVANT” word has values for every letter next to them in ascending order from 1 to 8 since
it is 8 letters. Besides, that same condition is valid for the other word “ELEPHANT” shown in
the figure as well. Therefore, there is an extra cell that is assigned as 0 because there is no
letter in a column or row naturally. At the beginning of the two words comparison start and
goes on at the end of the matrix the last element on the diagonal. In this manner, comparison
can be approached cell by cell and the responding row and column letter to that cell. In the
beginning, the trivial cell which has 0 value on the leftmost and the top one is selected, and
there is no letter for his cell. When we pass to the next cell one by one, we need to imply some
execution to make the right action according to the algorithm.
Figure 2: Levenshtein Matrix
For every cell, if the compared letters are equal then we have to assign the current cell's
previous diagonal cell value directly. Otherwise, if the compared letters are not the same as
each other, then it increments the three values around it at the left, top, and diagonal upper-
left. After incrementing, the smallest value of those results is selected as the new value of the
present cell. These implementations are applied to all empty cells from beginning to end cells
gradually as shown in Figure 2.
The Levenshtein algorithm (also called Livan-Distance) calculates the lowest number of editing
processes necessary to modify one series to get another series. The most common way to
calculate this is by the dynamic programming approach. A matrix is initialized to measure the
Levenshtein distance between the first character of one word and the last character of the
other word. The matrix can be filled from the top left to the bottom right corner. Each jump
in the text corresponds to an insertion or a deletion which is decided in the second step. The
cost of each operation is usually set to 1. The diagonal jump costs either one or zero depending
on whether the two characters in the row and column match. Each cell always tries to
minimize the cost locally as shown in the algorithm in Figure 3.
The second step is deciding which operations need to be executed to make both strings in the
same form. For this process, we need to complete the Levenshtein matrix based on the
procedure explained above. In this matrix, we have to focus on the last element of the matrix
that is located at the rightmost and lowest cell in the whole grid. For example, if the length of
the words is n and m then this first selected cell can be said that the mth and nth cell in the
matrix.
Figure 4: Operation decision logic
After this first cell, we control the three cell around the current cell which is upper, left, and
left upper (diagonal) and the minimum of three cell values is selected as the target cell and
our cursor move there. Before we move there, a decision has to be made about which
operation needs to be done. In addition to that, if replacement is selected as the optimal
operation we need to decide which letters require exchanging each other. If the minimum
value is in the left cell, we can conclude that our operation should be arranged as the deleting
current letter that is controlled against others. If the minimum value is in an upper cell, we
can deduce that the needed operation for the appropriate solution is insertion. Finally, if the
diagonal cell has the least value, then this means a replacement operation should be included
in the list of required operations. If no minimum value is found, i.e, all cells are equal, then we
skip this point without doing anything because they are the same letters exactly.
Figure 5: Backtracking
As we can see from Figure 5, until the first element is reached, the backtracking operation is
going on. When backtracking is implemented, an extra variable is held to record how many
operations are done in a way that after every operation this value is incremented by 1.
Generally, the targeted word is settled at the row of the compromised matrix. For deletion
and insertion, letters are chosen from its row index for the desired word on that side.
Figure 8: Table
5 is the result of this comparison and every operation is counted as 1 for the full matrix as
well. The bottom-right element of this matrix is the same as the five operations we observed
previously.
Another version is iterative with two rows. If we want to gain the final value alone, we can
easily modify the implementation of the above-mentioned provisions to avoid the allocation
of the entire matrix. To move forward, we only need two rows - the one we are currently
updating and the previous one.
This optimization makes it impossible to determine which edits were made. Hirschberg’s
algorithm solves this problem using both dynamic programming and division and conquer.
Furthermore, we may observe the fact that to calculate the value at the specific row position
we need only three values – the one to the left, the one directly above, and the last one
diagonal.
Figure 10: Lev Distance approach
Thus, our function may be modified to devote one row and two variants instead of two rows.
This modification makes the memory requirements for the application even more relaxed.
When we consider this algorithm, according to complexity, we can say that the length of the
words is the main parameter. The time complexity of all the iterative algorithms presented
above is O(|a| x |b|). Space complexity for the full matrix implementation is O(|a| x
|b|) which usually makes it impractical to use. Both two-rows and single-row
implementations provide linear space complexity O(max(|a| , |b|)) . Swapping source and
target to reduce computation row length will further reduce it to O(min(|a|, |b|)). It has been
shown that the Levenshtein distance cannot be calculated in subquadratic time unless
the strong exponential time hypothesis is false. Fortunately, this is only a partial description
of the complexity of the problem.
When we take this algorithm in terms of upper boundary and minimum distance, we can
say that some combined methods are used. Let's say we have a large string and want to
compare only similar strings, such as misspelled names. Complete Levenshtein computation
would have to traverse the full matrix in this scenario, including the high values in the top-
right and bottom-left corners that we won't require. This gives us an idea of how the threshold
could be improved, with all distances above a certain boundary simply being reported as out
of range. As a result, we only need to compute the values in the diagonal stripe of width 2K +
1 for bounded distance, where K is the distance threshold. In other words, if the Levenshtein
distance exceeds the boundary, the implementation will fail.
This method provides us with the time complexity of O(min(|a|,|b|)), which allows us to
execute large but comparable strings in a reasonable amount of time.
We can also skip the calculation if the distance exceeds the threshold we set because we know
the distance is at least the length difference between the strings.
PROGRAM RESULT
Entered words elephant and relevant and program ran and gave correct output edit distance
as 3, when K value is 3 or more, where k is maximum allowed changes.
[2] https://www.geeksforgeeks.org/backtracking-introduction/
[3] https://www.baeldung.com/cs/levenshtein-distance-computation
[4] https://dev.to/trekhleb/dynamic-programming-vs-divide-and-conquer-218i
[5] https://www.researchgate.net/figure/An-example-of-Algorithm-2-for-input-string-T-CATGACTG-
and-pattern-P-TACTG_fig5_320319792
[6] https://en.wikipedia.org/wiki/Levenshtein_distance
[7] https://dl.acm.org/cms/attachment/1b5b1be7-69a4-4d4d-ba1b-664cd797c9ce/www19-313-
fig3.jpg
[8] https://www.techiedelight.com/levenshtein-distance-edit-distance-problem/
[10] https://bdebo.medium.com/edit-distance-643a4bcfaa09
[11] https://afteracademy.com/blog/edit-distance-problem