0% found this document useful (0 votes)
49 views

Module 1 Jacard Distance and Editdistance

Jaccard Distance and Edit Distance are measures of similarity between strings or sets. Jaccard Distance is calculated as the difference between the size of the intersection and union of two sets, divided by the size of the union. Edit Distance finds the minimum number of operations (insertions, deletions, substitutions) required to change one string to another. Both can be used for spelling correction or plagiarism detection by finding the strings with the shortest distance. However, Jaccard Distance treats reordered strings as identical while Edit Distance is sensitive to word order.

Uploaded by

Dannapurna D
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Module 1 Jacard Distance and Editdistance

Jaccard Distance and Edit Distance are measures of similarity between strings or sets. Jaccard Distance is calculated as the difference between the size of the intersection and union of two sets, divided by the size of the union. Edit Distance finds the minimum number of operations (insertions, deletions, substitutions) required to change one string to another. Both can be used for spelling correction or plagiarism detection by finding the strings with the shortest distance. However, Jaccard Distance treats reordered strings as identical while Edit Distance is sensitive to word order.

Uploaded by

Dannapurna D
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Jaccard Distance

• We get Jaccard distance by subtracting the Jaccard coefficient from 1.


• We can also get it by dividing the difference between the sizes of the union and
the intersection of two sets by the size of the union.
• Jaccard Distance is given by the following formula
Jaccard Similarity defined as an intersection of two documents divided by the union of
that two documents that refer to the number of common words over a total number of
words. Here, we will use the set of words to find the intersection and union of the
document.

The mathematical representation of the Jaccard Similarity is:

The Jaccard Similarity score is in a range of 0 to 1. If the two documents are identical,
Jaccard Similarity is 1. The Jaccard similarity score is 0 if there are no common words
between two documents.
doc_1 = "Data is the new oil of the digital economy"
doc_2 = "Data is a new oil"

Let’s get the set of unique words for each document.

words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’}

words_doc2 = {'data', 'is', 'a', 'new', 'oil'}


Example 1:

import nltk
w1 = set('mapping')
w2 = set('mappings')

nltk.jaccard_distance(w1, w2)

Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly;
you must first convert them to the set type.
Example #2

Basic Spelling Checker: It is the same example we had with the Edit Distance
algorithm; now we are testing it with the Jaccard Distance algorithm. Let’s assume you
have a mistaken word and a list of possible words and you want to know the nearest
suggestion.

Import nltk
mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:


jd = nltk.jaccard_distance(set(mistake), set(word))
print(word, jd)
Again, comparing the mistaken word “ligting” to each word in our list, the least
Jaccard Distance is 0.166 for words: “listing” and “lighting” which means they are
the best spelling suggestions for “ligting” because they have the lowest distance.

Example #3

If you are wondering if there is a difference between the output of Edit Distance
and Jaccard Distance, see this example.
Import nltk
sent1 = set("It might help to re-install Python if possible.")
sent2 = set("It can help to install Python again if possible.")
sent3 = set("It can be so helpful to reinstall C++ if possible.")
sent4 = set("help It possible Python to re-install if might.")
sent5 = set("I love Python programming.")

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)


jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3)
jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4)
jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5)

print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')


print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3')
print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4')
print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5')
Just like when we applied Edit Distance, sent1 and sent2 are the most similar sentences.

However, look to the other results; they are completely different.

The most obvious difference is that the Edit Distance between sent1 and sent4 is 32 and
the Jaccard Distance is zero, which means the Jaccard Distance algorithms sees them as
identical sentence because Edit Distance depends on counting edit operations from the
start to end of the string while Jaccard Distance just counts the number characters and
then apply some calculations on this number as mentioned above.

Actually, there is no “right” or “wrong” answer; it all depends on what you really need
to do.
Edit distance method
Edit Distance measures dissimilarity between two strings by finding the minimum
number of operations needed to transform one string into the other.
The transformations that can be performed are:
• inserting a new character:
• bat -> bats (insertion of 's')
• Deleting an existing character.
• care -> car (deletion of 'e')
• Substituting an existing character.
• bin -> bit (substitution of n with t)
• Transposition of two existing consecutive characters.
• sing -> sign (transposition of ng to gn)
Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two
strings referred to as the source string and the target string.

The distance between the source string and the target string is the minimum number
of edit operations (deletions, insertions, or substitutions) required to transform the
source into the target. The lower the distance, the more similar the two strings.

Among the common applications of the Edit Distance algorithm are: spell checking,
plagiarism detection, and translation memory systems.
Edit Distance Python NLTK

Example 1:

import nltk
w1 = 'mapping'
w2 = 'mappings'

nltk.edit_distance(w1, w2)
Example #2

Basic Spelling Checker: Let’s assume you have a mistaken word and a list of possible
words and you want to know the nearest suggestion.
Import nltk
mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange',


'walking', 'zoo']

for word in words:


ed = nltk.edit_distance(mistake, word)
print(word, ed)
As you can see, comparing the mistaken word “ligting” to each word in our list, the
least Edit Distance is 1 for words: “listing” and “lighting” which means they are the
best spelling suggestions for “ligting”. Yes, a smaller Edit Distance between two
strings means they are more similar than others.
Example #3

Sentence or paragraph comparison is useful in applications like plagiarism detection


(to know if one article is a stolen version of another article), and translation memory
systems (that save previously translated sentences and when there is a new untranslated
sentence, the system retrieves a similar one that can be slightly edited by a human
translator instead of translating the new sentence from scratch).
import nltk
sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1
with a different order.
sent5 = "I love Python programming."

ed_sent_1_2 = nltk.edit_distance(sent1, sent2)


ed_sent_1_3 = nltk.edit_distance(sent1, sent3)
ed_sent_1_4 = nltk.edit_distance(sent1, sent4)
ed_sent_1_5 = nltk.edit_distance(sent1, sent5)

print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')


print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')
print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')
print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')
So it is clear that sent1 and sent2 are more similar to each other than other sentence
pairs.

You might also like