0% found this document useful (0 votes)

49 views

Module 1 Jacard Distance and Editdistance

Jaccard Distance and Edit Distance are measures of similarity between strings or sets. Jaccard Distance is calculated as the difference between the size of the intersection and union of two sets, divided by the size of the union. Edit Distance finds the minimum number of operations (insertions, deletions, substitutions) required to change one string to another. Both can be used for spelling correction or plagiarism detection by finding the strings with the shortest distance. However, Jaccard Distance treats reordered strings as identical while Edit Distance is sensitive to word order.

Uploaded by

Dannapurna D

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

Module 1 Jacard Distance and Editdistance

Uploaded by

Dannapurna D

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

Jaccard Distance

• We get Jaccard distance by subtracting the Jaccard coefficient from 1.

• We can also get it by dividing the difference between the sizes of the union and
the intersection of two sets by the size of the union.
• Jaccard Distance is given by the following formula
Jaccard Similarity defined as an intersection of two documents divided by the union of
that two documents that refer to the number of common words over a total number of
words. Here, we will use the set of words to find the intersection and union of the
document.

The mathematical representation of the Jaccard Similarity is:

The Jaccard Similarity score is in a range of 0 to 1. If the two documents are identical,
Jaccard Similarity is 1. The Jaccard similarity score is 0 if there are no common words
between two documents.
doc_1 = "Data is the new oil of the digital economy"
doc_2 = "Data is a new oil"

Let’s get the set of unique words for each document.

words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’}

words_doc2 = {'data', 'is', 'a', 'new', 'oil'}

Example 1:

import nltk
w1 = set('mapping')
w2 = set('mappings')

nltk.jaccard_distance(w1, w2)

Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly;
you must first convert them to the set type.
Example #2

Basic Spelling Checker: It is the same example we had with the Edit Distance
algorithm; now we are testing it with the Jaccard Distance algorithm. Let’s assume you
have a mistaken word and a list of possible words and you want to know the nearest
suggestion.

Import nltk
mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:

jd = nltk.jaccard_distance(set(mistake), set(word))
print(word, jd)
Again, comparing the mistaken word “ligting” to each word in our list, the least
Jaccard Distance is 0.166 for words: “listing” and “lighting” which means they are
the best spelling suggestions for “ligting” because they have the lowest distance.

Example #3

If you are wondering if there is a difference between the output of Edit Distance
and Jaccard Distance, see this example.
Import nltk
sent1 = set("It might help to re-install Python if possible.")
sent2 = set("It can help to install Python again if possible.")
sent3 = set("It can be so helpful to reinstall C++ if possible.")
sent4 = set("help It possible Python to re-install if might.")
sent5 = set("I love Python programming.")

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)

jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3)
jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4)
jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5)

print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')

print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3')
print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4')
print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5')
Just like when we applied Edit Distance, sent1 and sent2 are the most similar sentences.

However, look to the other results; they are completely different.

The most obvious difference is that the Edit Distance between sent1 and sent4 is 32 and
the Jaccard Distance is zero, which means the Jaccard Distance algorithms sees them as
identical sentence because Edit Distance depends on counting edit operations from the
start to end of the string while Jaccard Distance just counts the number characters and
then apply some calculations on this number as mentioned above.

Actually, there is no “right” or “wrong” answer; it all depends on what you really need
to do.
Edit distance method
Edit Distance measures dissimilarity between two strings by finding the minimum
number of operations needed to transform one string into the other.
The transformations that can be performed are:
• inserting a new character:
• bat -> bats (insertion of 's')
• Deleting an existing character.
• care -> car (deletion of 'e')
• Substituting an existing character.
• bin -> bit (substitution of n with t)
• Transposition of two existing consecutive characters.
• sing -> sign (transposition of ng to gn)
Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two
strings referred to as the source string and the target string.

The distance between the source string and the target string is the minimum number
of edit operations (deletions, insertions, or substitutions) required to transform the
source into the target. The lower the distance, the more similar the two strings.

Among the common applications of the Edit Distance algorithm are: spell checking,
plagiarism detection, and translation memory systems.
Edit Distance Python NLTK

Example 1:

import nltk
w1 = 'mapping'
w2 = 'mappings'

nltk.edit_distance(w1, w2)
Example #2

Basic Spelling Checker: Let’s assume you have a mistaken word and a list of possible
words and you want to know the nearest suggestion.
Import nltk
mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange',

'walking', 'zoo']

for word in words:

ed = nltk.edit_distance(mistake, word)
print(word, ed)
As you can see, comparing the mistaken word “ligting” to each word in our list, the
least Edit Distance is 1 for words: “listing” and “lighting” which means they are the
best spelling suggestions for “ligting”. Yes, a smaller Edit Distance between two
strings means they are more similar than others.
Example #3

Sentence or paragraph comparison is useful in applications like plagiarism detection

(to know if one article is a stolen version of another article), and translation memory
systems (that save previously translated sentences and when there is a new untranslated
sentence, the system retrieves a similar one that can be slightly edited by a human
translator instead of translating the new sentence from scratch).
import nltk
sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1
with a different order.
sent5 = "I love Python programming."

ed_sent_1_2 = nltk.edit_distance(sent1, sent2)

ed_sent_1_3 = nltk.edit_distance(sent1, sent3)
ed_sent_1_4 = nltk.edit_distance(sent1, sent4)
ed_sent_1_5 = nltk.edit_distance(sent1, sent5)

print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')

print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')
print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')
print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')
So it is clear that sent1 and sent2 are more similar to each other than other sentence
pairs.

Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Task 1
No ratings yet
Task 1
5 pages
Python For Everybody: Exploring Data Using Python 3
No ratings yet
Python For Everybody: Exploring Data Using Python 3
13 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
CS369N: Beyond Worst-Case Analysis Lecture #5: Self-Improving Algorithms
No ratings yet
CS369N: Beyond Worst-Case Analysis Lecture #5: Self-Improving Algorithms
11 pages
l19
No ratings yet
l19
13 pages
Locality Sensitive Hashing Towards Data Science
No ratings yet
Locality Sensitive Hashing Towards Data Science
16 pages
06 Binsearch
No ratings yet
06 Binsearch
15 pages
lecture10
No ratings yet
lecture10
7 pages
Chapter 3
No ratings yet
Chapter 3
38 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
Dijkstra's Algorithm With Fibonacci Heaps: An Executable Description in CHR
No ratings yet
Dijkstra's Algorithm With Fibonacci Heaps: An Executable Description in CHR
10 pages
Useful Algorithms and Programming Technique
No ratings yet
Useful Algorithms and Programming Technique
9 pages
Lec 7 Writeup
No ratings yet
Lec 7 Writeup
3 pages
Eustaquio, John Patrick A. Prof. Leonilla Elemento MEE-21: Experiment 3: Variables
No ratings yet
Eustaquio, John Patrick A. Prof. Leonilla Elemento MEE-21: Experiment 3: Variables
4 pages
DNA Sequence Alignment
No ratings yet
DNA Sequence Alignment
21 pages
Data Science With R
No ratings yet
Data Science With R
61 pages
Chapter 3 Programming Basics: 3.1 Conditional Expressions
No ratings yet
Chapter 3 Programming Basics: 3.1 Conditional Expressions
7 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Problem Set 4
No ratings yet
Problem Set 4
9 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Dictionaries: 'One' 'Uno'
No ratings yet
Dictionaries: 'One' 'Uno'
10 pages
Automatic Differentiation With Scala: Our Example Problem
No ratings yet
Automatic Differentiation With Scala: Our Example Problem
9 pages
Matlab Lab3
No ratings yet
Matlab Lab3
7 pages
Unit I Asymptotic Notations Asymptotic Notation: O, , !, and
No ratings yet
Unit I Asymptotic Notations Asymptotic Notation: O, , !, and
59 pages
A Fast Method for Cryptoanalysis of Substitution Ciphers
No ratings yet
A Fast Method for Cryptoanalysis of Substitution Ciphers
11 pages
Homework 9: Independent and Paired Samples T-Tests: Information 1
No ratings yet
Homework 9: Independent and Paired Samples T-Tests: Information 1
7 pages
LP Solve
No ratings yet
LP Solve
8 pages
MBAn Technical Interview Skills Training Workshop Part 1 With Jack Dunn Daisy Zhou Slides PDF
No ratings yet
MBAn Technical Interview Skills Training Workshop Part 1 With Jack Dunn Daisy Zhou Slides PDF
82 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
03 Hash
No ratings yet
03 Hash
84 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
NLP_Midterm_Spring2025
No ratings yet
NLP_Midterm_Spring2025
7 pages
Pertemuan 8
No ratings yet
Pertemuan 8
30 pages
UNIT-4
No ratings yet
UNIT-4
9 pages
BCA Computer Concept & C Programming SEMESTER - 1 ST Assignment
29% (7)
BCA Computer Concept & C Programming SEMESTER - 1 ST Assignment
7 pages
Pract 1 Measuring The Document Similarity in Python
No ratings yet
Pract 1 Measuring The Document Similarity in Python
6 pages
Chapter 11
No ratings yet
Chapter 11
10 pages
Sorting and Searching
No ratings yet
Sorting and Searching
14 pages
CSCI 311 Project 1 Ver 05
No ratings yet
CSCI 311 Project 1 Ver 05
5 pages
Reed Solomon Code: 1 Introduction To Coding
No ratings yet
Reed Solomon Code: 1 Introduction To Coding
4 pages
Dynamic Programming and Single Word Recognizers (Part 1)
No ratings yet
Dynamic Programming and Single Word Recognizers (Part 1)
25 pages
Differentially Private Decision Trees
No ratings yet
Differentially Private Decision Trees
5 pages
lec_09
No ratings yet
lec_09
16 pages
Unit1 Daa
No ratings yet
Unit1 Daa
21 pages
Interview 4 Apple
No ratings yet
Interview 4 Apple
4 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Writing Endian-Independent Code in C
No ratings yet
Writing Endian-Independent Code in C
10 pages
Binary Search in JavaScript
No ratings yet
Binary Search in JavaScript
10 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Project Machine Translation
No ratings yet
Project Machine Translation
45 pages
Levenshtein Distance PDF
No ratings yet
Levenshtein Distance PDF
3 pages
5CS4-AOA-Unit-4_ppt @zammers
No ratings yet
5CS4-AOA-Unit-4_ppt @zammers
68 pages
Package Wordcloud': R Topics Documented
No ratings yet
Package Wordcloud': R Topics Documented
9 pages
Extra credit I
No ratings yet
Extra credit I
4 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-04 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-04 - Reference-Material-I
18 pages
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-06 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-06 - Reference-Material-I
39 pages
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-09 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-09 - Reference-Material-I
21 pages
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-18 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - STS2007 - TH - AP2023246001135 - 2024-01-18 - Reference-Material-I
21 pages
Modern Physics PHY1008
No ratings yet
Modern Physics PHY1008
89 pages
Mining Frequent Itemsets Using Apriori Algorithm
No ratings yet
Mining Frequent Itemsets Using Apriori Algorithm
5 pages
Introduction To Algorithms: Prof. Shafi Goldwasser
No ratings yet
Introduction To Algorithms: Prof. Shafi Goldwasser
30 pages
Imp Questions For Ci - Update
No ratings yet
Imp Questions For Ci - Update
8 pages
Daa (PBL) New3
No ratings yet
Daa (PBL) New3
15 pages
CS-370-Week 3-2
No ratings yet
CS-370-Week 3-2
23 pages
ErinCatto DynamicBVH GDC2019
No ratings yet
ErinCatto DynamicBVH GDC2019
89 pages
VU C5 T P Data Structure 2021
No ratings yet
VU C5 T P Data Structure 2021
4 pages
2.shell Programming
No ratings yet
2.shell Programming
18 pages
Problem Statement For B-Tree: Functionalities: Insertion
No ratings yet
Problem Statement For B-Tree: Functionalities: Insertion
7 pages
Bog o Notation
No ratings yet
Bog o Notation
6 pages
6 Recursion
No ratings yet
6 Recursion
23 pages
Gausove Metode Ovo Ono
No ratings yet
Gausove Metode Ovo Ono
21 pages
A Multi-Action Deep Reinforcement Learning Framework For Flexible Job-Shop Scheduling Problem
No ratings yet
A Multi-Action Deep Reinforcement Learning Framework For Flexible Job-Shop Scheduling Problem
18 pages
Experiment 3 DS
No ratings yet
Experiment 3 DS
11 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
EC8381 Fundamentals of Data Structures in C Laboratary
100% (3)
EC8381 Fundamentals of Data Structures in C Laboratary
88 pages
Arrays-WPS Office
No ratings yet
Arrays-WPS Office
3 pages
Numerical Methods QB
No ratings yet
Numerical Methods QB
4 pages
1 PB
No ratings yet
1 PB
11 pages
Nlpp
No ratings yet
Nlpp
37 pages
CTSD Project Final2
No ratings yet
CTSD Project Final2
16 pages
Lecture05 Informed Search (Part 1)
No ratings yet
Lecture05 Informed Search (Part 1)
26 pages
BECOB236 Code
No ratings yet
BECOB236 Code
10 pages
Search Algorithms in Artificial Intelligence
No ratings yet
Search Algorithms in Artificial Intelligence
13 pages
Team Ex 5 1 8
No ratings yet
Team Ex 5 1 8
14 pages
Math Task
No ratings yet
Math Task
2 pages
Hopscotch
No ratings yet
Hopscotch
28 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
6 pages
Homework 3
No ratings yet
Homework 3
2 pages
Lab 4 Report
No ratings yet
Lab 4 Report
5 pages

Module 1 Jacard Distance and Editdistance

Uploaded by

Module 1 Jacard Distance and Editdistance

Uploaded by

Jaccard Distance

• We get Jaccard distance by subtracting the Jaccard coefficient from 1.

The mathematical representation of the Jaccard Similarity is:

Let’s get the set of unique words for each document.

words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy’}

words_doc2 = {'data', 'is', 'a', 'new', 'oil'}

for word in words:

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)

print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')

However, look to the other results; they are completely different.

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange',

for word in words:

Sentence or paragraph comparison is useful in applications like plagiarism detection

ed_sent_1_2 = nltk.edit_distance(sent1, sent2)

print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')

You might also like