0% found this document useful (0 votes)

22 views

Similarity Distances For Natural Language Processing

The document discusses various techniques for measuring the similarity between text units, including longest common substring, Levenshtein edit distance, Hamming distance, and cosine similarity. It provides definitions and examples of each technique. Longest common substring measures the length of the longest string that is a substring of two strings. Levenshtein edit distance calculates the minimum number of single-character edits needed to change one string into another. Hamming distance counts the number of positions at which corresponding symbols are different. Cosine similarity measures the angle between vectors representing text units.

Uploaded by

ibrahimcakirlar35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Similarity Distances For Natural Language Processing

Uploaded by

ibrahimcakirlar35

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Similarity Distances for Natural Language

Processing
Flavien Vidal · Follow
8 min read · Apr 29, 2021

Listen Share

source: _dChris via flickr (CC BY 2.0)

Importance of having good text representation:

Feature engineering and selection is surely one of the most crucial steps of any
machine learning project. No matter the algorithm you choose to use, if the feature
you giving it are bad, the results you are going to get will be bad too. It is well
summarized by the expression: “garbage in, garbage out”. Feature engineering is
only optimal if we have a good knowledge of the problem and it strongly depends on
the data in question. However, some recurrent techniques are widely used in
practice and can be helpful in many different problems. In this article, we focus on

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 1/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

the case of textual data and address the question: how to handle feature engineering
with textual data?

Vector Space Model

In order for a computer to work with text data, it should be converted into some
mathematical form. Therefore, we usually represent text units (characters, words,
sentences, paragraphs, documents, …) with vectors.
The idea is to represent these text units as vectors of numbers that will enable many
operations such as information retrieval, document clustering, language translation
or text summarization. Thus, the vector space model we choose for a given problem
should enable us to easily compare documents by measuring the distance between
their corresponding vectors, that is to say their similarity. There are multiple ways to
compute vectors that capture the structure or semantics of text units.
Before deep diving into the different methods of text representation, it is necessary
to start by discussing some techniques for measuring similarity between text units.

Quantifying similarity between text units:

There are a wide variety of similarity metrics that can be used for quantifying the
similarity between text units, but each of them do not share the same definition of
what “similar” means. Depending on the problem we face and depending on the
type of resemblance we want to catch between units we may choose one metric over
another.

Some of the most common ways to capture similarity between text units are:
- Longest Common Substring (LCS),
- Levenshtein Edit Distance,
- Hamming Distance,
- Cosine Similarity,
- Jaccard Distance,
- Euclidean Distance,

Longest Common Substring (LCS):

The LCS is a common example of character-based similarity measure. Given two

strings, s1 of length n1 and s2 of length n2, it simply considers the length of the
longest string (or strings) which is substring of both s1 and s2.

Applications: data deduplication and plagiarism detection.

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 2/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
1 def LCS(s1, s2):
2 n1, n2, res = len(s1), len(s2), 0
3 LCSubstring = [[0 for c in range(n2 + 1)] for r in range(n1 + 1)]
4 for i in range(n1 + 1):
5 for j in range(n2 + 1):
6 if (i == 0 or j == 0):
7 LCSubstring[i][j] = 0
8 elif (s1[i-1] == s2[j-1]):
9 LCSubstring[i][j] = LCSubstring[i-1][j-1] + 1
10 res = max(res, LCSubstring[i][j])
11 else: LCSubstring[i][j] = 0
12 return res

LCS hosted with ❤ by GitHub view raw

Implementation Proposition of Longest Common Substring

Example: the LCS of strings ‘Lydia’ and ‘Lydoko’ simply returns 3.

Levenshtein Edit Distance (V. Levenshtein, 1965):

The Levenshtein distance is another common example of character-based similarity

measure. It quantifies how dissimilar two text units are to one another by
computing the minimum number of single-character edits (replacement, deletion
and insertion operations) required to convert text unit 1 into text unit 2.
Mathematically, it can be written as:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 3/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Properties:

Applications:
- in approximate string matching where the objective is to find matches for short
strings in longer texts (spelling correction, OCR correction systems, …),
- in bioinformatics to quantify the similarity of DNA sequences (which can be
viewed as strings of letters A, C, G and T),
- and it can also be used in music to measure the similarity of melodies.

1 def LevenshteinEditDistance(s1, s2, n1, n2):

2 if n1 == 0: return n2
3 if n2 == 0: return n1
4 if s1[n1-1] == s2[n2-1]:
5 return LevenshteinEditDistance(s1, s2, n1-1, n2-1)
6 return 1 + min(LevenshteinEditDistance(s1, s2, n1, n2-1), # insert
7 LevenshteinEditDistance(s1, s2, n1-1, n2), # delete
8 LevenshteinEditDistance(s1, s2, n1-1, n2-1)) # replace
9

LevenshteinEditDistance hosted with ❤ by GitHub view raw

Implementation Proposition of Levenshtein Distance

Example: changing ‘hubert’ into ‘uber’ would only need 2 operations, thus
lev(‘hubert’, ‘uber’) = 2.

NB: different definitions of edit distances use different sets of operations. The
Longest common subsequence distance allows only insertion and deletion but not
substitution. Hamming distance allows only substitution and therefore only applies
to strings of same lengths. Damerau–Levenshtein distance allows insertion,
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 4/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

deletion, replacement and transposition of two adjacent characters. Jaro distance

only allows transposition.

Hamming Distance (R. Hamming, 1950):

Another character-based similarity measure is the Hamming distance. Hamming

distance between two equal size strings measures the minimum number of
replacements required to change one string into the other. Mathematically, this can
be written as follows:

Properties:

Application: mainly used in coding theory for error detection and correction (note
that the following representation also exhibits the fact that the Hamming distance of
binary chains is equivalent to the Manhattan distance between vertices).

3-bit binary cube (left) and 4-bit tesseract (right) for finding Hamming distances

1 def HammingDistance(s1, s2):

2 res = 0
3 for i in range(len(s1)):
4 if s1[i] != s2[i]: res += 1
5 return res

H i Di t h t d ith ❤ b GitH b
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55
view raw 5/16
view raw
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
HammingDistance hosted with ❤ by GitHub
Implementation Proposition of Hamming Distance

Example: the Hamming distance between strings ‘Lydia’ and ‘Media’ is 2.

Cosine Similarity:

The cosine similarity measures the proximity between two non-zero vectors of a
pre-Hilbert space. The cosine similarity of two text units simply computes the
cosine of the angle formed by the two vectors representing the text units, i.e. the
inner product of Euclidian space of the normalized vectors. When close to 1, the two
units are close in the chosen vector space, when close to -1, the two units are far
apart.

Therefore, this metric is only an appreciation of orientation and not of magnitude:

two vectors with the same orientation have a cosine similarity of 1, two vectors
oriented at 90° to each other have a similarity of 0, and two diametrically opposed
vectors have a similarity of -1.

Open in app Sign up Sign In

Example of cosine similarities in a 2-Dimension space
Search
Jaccard Distance:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 6/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

The Jaccard distance measures how dissimilar two multisets are: the lower the
distance, the more similar the two multisets. It is computed using the Jaccard index
(or Jaccard similarity coefficient) which is the ratio of the cardinal of the
intersection of the multisets to the cardinal of their union. The distance is then
obtained by subtracting the index from 1. Mathematically, it can be written as:

These expressions are undefined if both A and B are empty sets, in which case we
define the index and distance to be 1 and 0 respectively.

The Jaccard distance works quite well in practice, especially for sparse data. For
example, a streaming service like Netflix could represent customers as multisets of
movies watched, and then use Jaccard distance to measure the similarity between
two customers, i.e. how close their tastes are. Then, based on the preferences of two
users and their similarity we could potentially make recommendations to one or the
other.
Similarly, if we represent documents in terms of the multisets of words they contain,
then the Jaccard distance between two of the documents is often a reasonable
measure of their similarity. In this case we would represent the multisets A and B as
vectors va and vb with the i-th index of vector va equalling the number of times that
the i-th element is represented in A:

Jaccard distance used to compare text units represented as BoW will typically
present some flaws: as the size of the document increases, the number of common
words tend to increase even if the documents talk about different topics. Moreover,
this metric will not be able to capture the similarity between different text units that
have the same meaning but are written differently (this issue is more an issue of text
representation but since Jaccard distance is particularly well suited for BoW
strategy, it still becomes a concern). For example these two text blobs have the same
meaning but there Jaccard distance will be close to 1 since they do not share any
common words:
Text unit 1: President greets the press in Chicago
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 7/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Text unit 2: Obama speaks in Illinois

Another concern is the sense of the sentence:
Text unit 1: The dog bites the man
Text unit 2: The man bites the dog
Although these two units have totally different meaning, the Jaccard distance will
consider them as being the most similar.

Euclidean Distance

Euclidean distance is a token-based similarity distance. Given two points in Rn, the
Euclidean distance metric is simply the length of a line segment between these two
points. It is also often referred as the l2-norm, the l2-distance or the Pythagorean
metric and can be expressed as:

Properties:
The Euclidean distance is symmetric, positive and obeys the triangle inequality.

One of the important properties of this norm, relative to other norms, is that it
remains unchanged under arbitrary rotations of space around the origin.

Applications:
- in Euclidean geometry, to find the shortest distance between two points in a
Euclidean space,
- in clustering algorithms such as K-means,
- in statistics to measure the similarity between data points, or to use in methods
such as least squares (where we use the squared Euclidean distance because it
allows convex analysis)

More generally, the Minkowski or lp distance is a generalization of the Euclidean

distance:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 8/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

It is equal to the Manhattan distance if p=1, equal to the Euclidean distance if p=2,
and equal to the Chebyshev distance (maximum or L∞ metric) if p approaches
infinity.

Conclusion
Now that you have an overview of some of the most common distance metrics used
in NLP, you can choose to continue working on other metrics or work on text
vectorization to add to your knowledge.
I hope you enjoyed this article! Please feel free to contact me if you have any
questions or if you feel that additional explanations should be added to this article.

Thanks for reading!

*All images (including formulas)are by the author except where stated otherwise

Machine Learning NLP Naturallanguageprocessing Cosine Similarity

Levenshtein Distance

Written by Flavien Vidal

18 Followers

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 9/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

More from Flavien Vidal

Flavien Vidal

Negative Sampling vs Hierarchical Softmax

In this article, we explore and compare the two approaches presented by Mikolov et al. in their
paper: “Distributed Representations of…

6 min read · Nov 2, 2021

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 10/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Flavien Vidal

Exposition and Interpretation of the Topology of Neural Networks

Neural networks are powerful and widely used tools but their interpretability is often far from
ideal. For this reason, it is interesting…

11 min read · Nov 11, 2021

Flavien Vidal

Architecting World’s Largest Biometric Identity System: Aadhaar

Experience
In this article, we will discuss the Aadhaar project, which was launched on January 28, 2009
and received a lot of international attention…

4 min read · Oct 19, 2021

See all from Flavien Vidal

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 11/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Recommended from Medium

Mohamad Mahmood in Dev Genius

TFIDF Calculation Using SKLearn’s TfidfVectorizer

accompanied by the step-by-step manual TFIDF calculation

8 min read · Jul 19

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 12/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Farhan Sarguroh

03: Basic Text Preprocessing (NLP)

Text preprocessing is an essential step in natural language processing (NLP) tasks that
involves cleaning and transforming raw text data…

6 min read · Jul 10

Lists

Predictive Modeling w/ Python

20 stories · 516 saves

Practical Guides to Machine Learning

10 stories · 589 saves

Natural Language Processing

739 stories · 331 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories · 163 saves

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 13/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Anar Abiyev in Artificial Intelligence in Plain English

Sentence Similarity with Sentence Transformers for NLP projects

Clear explanation and Python codes to apply sentence transformers in sentence similarity
tasks.

· 3 min read · Sep 23

Yassine EL KHAL

The complete guide to string similarity algorithms

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 14/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Introduction

14 min read · Aug 21

139 1

Dhruv Matani

All Pairs Cosine Similarity in PyTorch

PyTorch defines a cosine_similarity function to compute pairwise cosine similarity between
pairs of vectors. However, there’s no method to…

7 min read · Jun 8

15 1

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 15/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

TechClaw

Cosine similarity between two arrays for word embeddings

Introduction

2 min read · Jul 11

See more recommendations

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 16/16

Homework #1, Sec 10.1
75% (4)
Homework #1, Sec 10.1
5 pages
Job Recommendation: An Approach To Match Job-Seeker's Interest With Enterprise's Requirement - Ngoc-Trung-Kien Ho & Hung Ho-Dac & Tuan-Anh Le
No ratings yet
Job Recommendation: An Approach To Match Job-Seeker's Interest With Enterprise's Requirement - Ngoc-Trung-Kien Ho & Hung Ho-Dac & Tuan-Anh Le
7 pages
Introduction To Representation Theory
100% (1)
Introduction To Representation Theory
240 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
alshammari-2023-ijca-922667
No ratings yet
alshammari-2023-ijca-922667
4 pages
Information 11 00421 v2
No ratings yet
Information 11 00421 v2
17 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Similarity Metric
No ratings yet
Similarity Metric
13 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
Note 4
No ratings yet
Note 4
1 page
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
Problem Set 5 Instructions
No ratings yet
Problem Set 5 Instructions
8 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
06879d26e3ba5b6fb7feeddc199f24dd4ff6
No ratings yet
06879d26e3ba5b6fb7feeddc199f24dd4ff6
7 pages
10 1002@cpe 5971
No ratings yet
10 1002@cpe 5971
17 pages
Newer Method of String Comparison: The Modified Moving Contracting Window Pattern Algorithm
No ratings yet
Newer Method of String Comparison: The Modified Moving Contracting Window Pattern Algorithm
9 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
The Stringdist Package For Approximate String Matching
No ratings yet
The Stringdist Package For Approximate String Matching
13 pages
BioInfor assignment
No ratings yet
BioInfor assignment
4 pages
Lab 1 - Introduction To Python
No ratings yet
Lab 1 - Introduction To Python
5 pages
Java Lect 15
No ratings yet
Java Lect 15
14 pages
Task 1
No ratings yet
Task 1
5 pages
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
No ratings yet
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
13 pages
Module 1 Jacard Distance and Editdistance
No ratings yet
Module 1 Jacard Distance and Editdistance
16 pages
UMA Literature Survey
No ratings yet
UMA Literature Survey
11 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
Automated Methods For The Comparison of Natural La
No ratings yet
Automated Methods For The Comparison of Natural La
11 pages
Project Report
No ratings yet
Project Report
12 pages
Ben Shade
No ratings yet
Ben Shade
68 pages
Week 5
No ratings yet
Week 5
64 pages
Language Trees and Zipping: Olume Umber
No ratings yet
Language Trees and Zipping: Olume Umber
4 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Semester Final Project Report
No ratings yet
Semester Final Project Report
11 pages
A Comparative Analysis of Temporal Long Text Similarity: Application To Financial Documents
No ratings yet
A Comparative Analysis of Temporal Long Text Similarity: Application To Financial Documents
15 pages
Project
No ratings yet
Project
11 pages
Semantic Text Analysis
No ratings yet
Semantic Text Analysis
6 pages
R11-1071
No ratings yet
R11-1071
6 pages
Natural Language Processing 3
No ratings yet
Natural Language Processing 3
8 pages
Evaluating of Efficacy Semantic Similarity Methods
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
8 pages
Levenshtein Distance PDF
No ratings yet
Levenshtein Distance PDF
3 pages
Mini-Project #2: Instructions
No ratings yet
Mini-Project #2: Instructions
5 pages
Cross Correlation: Unlocking Patterns in Computer Vision
From Everand
Cross Correlation: Unlocking Patterns in Computer Vision
Fouad Sabry
No ratings yet
Lecture -7 MSDS
No ratings yet
Lecture -7 MSDS
32 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
tkde-2014-26-7
No ratings yet
tkde-2014-26-7
17 pages
LSAfun
No ratings yet
LSAfun
35 pages
BDA
No ratings yet
BDA
31 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
No ratings yet
Plagiarism Checker Amp Link Advisor Using Concepts of Levenshtein Distance Algorithm With Google Query Search - An Approach
6 pages
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
No ratings yet
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
21 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
2312.01759v1
No ratings yet
2312.01759v1
34 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
Nlp Project[1]
No ratings yet
Nlp Project[1]
16 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
Chatbot.ipynb - Colaboratory
No ratings yet
Chatbot.ipynb - Colaboratory
5 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Nilpotent - Semidirect
No ratings yet
Nilpotent - Semidirect
3 pages
Algebraic Groups and Discontinuous Subgroups
No ratings yet
Algebraic Groups and Discontinuous Subgroups
424 pages
Geometry and Platonic Solids
No ratings yet
Geometry and Platonic Solids
12 pages
Intro Smooth Manif Lee 726 P-331-335
No ratings yet
Intro Smooth Manif Lee 726 P-331-335
5 pages
Associative Algebras and Schur-Weyl Duality
No ratings yet
Associative Algebras and Schur-Weyl Duality
68 pages
Hyperbolic Written
No ratings yet
Hyperbolic Written
6 pages
An Introduction To Group Theory PDF
No ratings yet
An Introduction To Group Theory PDF
138 pages
Sylow Notes
No ratings yet
Sylow Notes
3 pages
Trialgebras and Leibniz 3-Algebras: Lie Lie Ann Lie
No ratings yet
Trialgebras and Leibniz 3-Algebras: Lie Lie Ann Lie
14 pages
Hungerford Solution 1-5GroupProduct
No ratings yet
Hungerford Solution 1-5GroupProduct
9 pages
Laplace Table PDF
No ratings yet
Laplace Table PDF
2 pages
Maa6617 Course Notes SPRING 2020
No ratings yet
Maa6617 Course Notes SPRING 2020
51 pages
Homework Solution 2
No ratings yet
Homework Solution 2
5 pages
Aditi
No ratings yet
Aditi
22 pages
Let S = (1, ω, ω 1 × 1 = 1 × 1 1 1 1 1
No ratings yet
Let S = (1, ω, ω 1 × 1 = 1 × 1 1 1 1 1
2 pages
The Quaternion Group and Modern Physics - P. R. Girard - 1984
No ratings yet
The Quaternion Group and Modern Physics - P. R. Girard - 1984
9 pages
Dikin
No ratings yet
Dikin
23 pages
4 - 03-02-2021 - 11-50-12 - Abstract Algebra - Paper Code - 20MAT21C1
No ratings yet
4 - 03-02-2021 - 11-50-12 - Abstract Algebra - Paper Code - 20MAT21C1
142 pages
2-1 Addition Properties: Math Notes Algebra and Functions 1.2
No ratings yet
2-1 Addition Properties: Math Notes Algebra and Functions 1.2
33 pages
Abstract Algebra
No ratings yet
Abstract Algebra
117 pages
Maciejko Representation of Lorentz and Poincare Groups
No ratings yet
Maciejko Representation of Lorentz and Poincare Groups
11 pages
XXXXHFGJ
No ratings yet
XXXXHFGJ
3 pages
Harmonic Map - Wikipedia
No ratings yet
Harmonic Map - Wikipedia
55 pages
Noetherian Algebras
No ratings yet
Noetherian Algebras
60 pages
Integral CheatSheet All
No ratings yet
Integral CheatSheet All
5 pages
Abstract Algebra
No ratings yet
Abstract Algebra
129 pages
Group Theory: - Faiza Ansari MSC - 1 Roll No.1
No ratings yet
Group Theory: - Faiza Ansari MSC - 1 Roll No.1
11 pages
Foot Dummit 1) 1.1 (9b, 22, 25, 31) &amp 1.2 (16) &amp 1.3 (19, 20) &amp 1.4 (8) &amp 1.5 (1) &amp 1.6 (4, 9, 17, 20, 24, 26) &amp 1.7 (18, 19, 20, 23)
No ratings yet
Foot Dummit 1) 1.1 (9b, 22, 25, 31) &amp 1.2 (16) &amp 1.3 (19, 20) &amp 1.4 (8) &amp 1.5 (1) &amp 1.6 (4, 9, 17, 20, 24, 26) &amp 1.7 (18, 19, 20, 23)
9 pages

Similarity Distances For Natural Language Processing

Uploaded by

Similarity Distances For Natural Language Processing

Uploaded by

24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Similarity Distances for Natural Language

source: _dChris via flickr (CC BY 2.0)

Importance of having good text representation:

Vector Space Model

Quantifying similarity between text units:

Longest Common Substring (LCS):

The LCS is a common example of character-based similarity measure. Given two

Applications: data deduplication and plagiarism detection.

LCS hosted with ❤ by GitHub view raw

Example: the LCS of strings ‘Lydia’ and ‘Lydoko’ simply returns 3.

Levenshtein Edit Distance (V. Levenshtein, 1965):

The Levenshtein distance is another common example of character-based similarity

1 def LevenshteinEditDistance(s1, s2, n1, n2):

LevenshteinEditDistance hosted with ❤ by GitHub view raw

deletion, replacement and transposition of two adjacent characters. Jaro distance

Hamming Distance (R. Hamming, 1950):

Another character-based similarity measure is the Hamming distance. Hamming

1 def HammingDistance(s1, s2):

Example: the Hamming distance between strings ‘Lydia’ and ‘Media’ is 2.

Therefore, this metric is only an appreciation of orientation and not of magnitude:

Open in app Sign up Sign In

Text unit 2: Obama speaks in Illinois

More generally, the Minkowski or lp distance is a generalization of the Euclidean

Thanks for reading!

Machine Learning NLP Naturallanguageprocessing Cosine Similarity

Written by Flavien Vidal

More from Flavien Vidal

Negative Sampling vs Hierarchical Softmax

6 min read · Nov 2, 2021

Exposition and Interpretation of the Topology of Neural Networks

11 min read · Nov 11, 2021

Architecting World’s Largest Biometric Identity System: Aadhaar

4 min read · Oct 19, 2021

See all from Flavien Vidal

Recommended from Medium

Mohamad Mahmood in Dev Genius

TFIDF Calculation Using SKLearn’s TfidfVectorizer

8 min read · Jul 19

03: Basic Text Preprocessing (NLP)

6 min read · Jul 10

Predictive Modeling w/ Python

Practical Guides to Machine Learning

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

Anar Abiyev in Artificial Intelligence in Plain English

Sentence Similarity with Sentence Transformers for NLP projects

· 3 min read · Sep 23

The complete guide to string similarity algorithms

14 min read · Aug 21

All Pairs Cosine Similarity in PyTorch

7 min read · Jun 8

Cosine similarity between two arrays for word embeddings

2 min read · Jul 11

See more recommendations

You might also like