Module 1 Jacard Distance and Editdistance
Module 1 Jacard Distance and Editdistance
The Jaccard Similarity score is in a range of 0 to 1. If the two documents are identical,
Jaccard Similarity is 1. The Jaccard similarity score is 0 if there are no common words
between two documents.
doc_1 = "Data is the new oil of the digital economy"
doc_2 = "Data is a new oil"
import nltk
w1 = set('mapping')
w2 = set('mappings')
nltk.jaccard_distance(w1, w2)
Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly;
you must first convert them to the set type.
Example #2
Basic Spelling Checker: It is the same example we had with the Edit Distance
algorithm; now we are testing it with the Jaccard Distance algorithm. Let’s assume you
have a mistaken word and a list of possible words and you want to know the nearest
suggestion.
Import nltk
mistake = "ligting"
words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']
Example #3
If you are wondering if there is a difference between the output of Edit Distance
and Jaccard Distance, see this example.
Import nltk
sent1 = set("It might help to re-install Python if possible.")
sent2 = set("It can help to install Python again if possible.")
sent3 = set("It can be so helpful to reinstall C++ if possible.")
sent4 = set("help It possible Python to re-install if might.")
sent5 = set("I love Python programming.")
The most obvious difference is that the Edit Distance between sent1 and sent4 is 32 and
the Jaccard Distance is zero, which means the Jaccard Distance algorithms sees them as
identical sentence because Edit Distance depends on counting edit operations from the
start to end of the string while Jaccard Distance just counts the number characters and
then apply some calculations on this number as mentioned above.
Actually, there is no “right” or “wrong” answer; it all depends on what you really need
to do.
Edit distance method
Edit Distance measures dissimilarity between two strings by finding the minimum
number of operations needed to transform one string into the other.
The transformations that can be performed are:
• inserting a new character:
• bat -> bats (insertion of 's')
• Deleting an existing character.
• care -> car (deletion of 'e')
• Substituting an existing character.
• bin -> bit (substitution of n with t)
• Transposition of two existing consecutive characters.
• sing -> sign (transposition of ng to gn)
Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two
strings referred to as the source string and the target string.
The distance between the source string and the target string is the minimum number
of edit operations (deletions, insertions, or substitutions) required to transform the
source into the target. The lower the distance, the more similar the two strings.
Among the common applications of the Edit Distance algorithm are: spell checking,
plagiarism detection, and translation memory systems.
Edit Distance Python NLTK
Example 1:
import nltk
w1 = 'mapping'
w2 = 'mappings'
nltk.edit_distance(w1, w2)
Example #2
Basic Spelling Checker: Let’s assume you have a mistaken word and a list of possible
words and you want to know the nearest suggestion.
Import nltk
mistake = "ligting"