0% found this document useful (0 votes)
22 views

Similarity Distances For Natural Language Processing

The document discusses various techniques for measuring the similarity between text units, including longest common substring, Levenshtein edit distance, Hamming distance, and cosine similarity. It provides definitions and examples of each technique. Longest common substring measures the length of the longest string that is a substring of two strings. Levenshtein edit distance calculates the minimum number of single-character edits needed to change one string into another. Hamming distance counts the number of positions at which corresponding symbols are different. Cosine similarity measures the angle between vectors representing text units.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Similarity Distances For Natural Language Processing

The document discusses various techniques for measuring the similarity between text units, including longest common substring, Levenshtein edit distance, Hamming distance, and cosine similarity. It provides definitions and examples of each technique. Longest common substring measures the length of the longest string that is a substring of two strings. Levenshtein edit distance calculates the minimum number of single-character edits needed to change one string into another. Hamming distance counts the number of positions at which corresponding symbols are different. Cosine similarity measures the angle between vectors representing text units.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Similarity Distances for Natural Language


Processing
Flavien Vidal · Follow
8 min read · Apr 29, 2021

Listen Share

source: _dChris via flickr (CC BY 2.0)

Importance of having good text representation:


Feature engineering and selection is surely one of the most crucial steps of any
machine learning project. No matter the algorithm you choose to use, if the feature
you giving it are bad, the results you are going to get will be bad too. It is well
summarized by the expression: “garbage in, garbage out”. Feature engineering is
only optimal if we have a good knowledge of the problem and it strongly depends on
the data in question. However, some recurrent techniques are widely used in
practice and can be helpful in many different problems. In this article, we focus on

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 1/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

the case of textual data and address the question: how to handle feature engineering
with textual data?

Vector Space Model


In order for a computer to work with text data, it should be converted into some
mathematical form. Therefore, we usually represent text units (characters, words,
sentences, paragraphs, documents, …) with vectors.
The idea is to represent these text units as vectors of numbers that will enable many
operations such as information retrieval, document clustering, language translation
or text summarization. Thus, the vector space model we choose for a given problem
should enable us to easily compare documents by measuring the distance between
their corresponding vectors, that is to say their similarity. There are multiple ways to
compute vectors that capture the structure or semantics of text units.
Before deep diving into the different methods of text representation, it is necessary
to start by discussing some techniques for measuring similarity between text units.

Quantifying similarity between text units:


There are a wide variety of similarity metrics that can be used for quantifying the
similarity between text units, but each of them do not share the same definition of
what “similar” means. Depending on the problem we face and depending on the
type of resemblance we want to catch between units we may choose one metric over
another.

Some of the most common ways to capture similarity between text units are:
- Longest Common Substring (LCS),
- Levenshtein Edit Distance,
- Hamming Distance,
- Cosine Similarity,
- Jaccard Distance,
- Euclidean Distance,

Longest Common Substring (LCS):

The LCS is a common example of character-based similarity measure. Given two


strings, s1 of length n1 and s2 of length n2, it simply considers the length of the
longest string (or strings) which is substring of both s1 and s2.

Applications: data deduplication and plagiarism detection.

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 2/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
1 def LCS(s1, s2):
2 n1, n2, res = len(s1), len(s2), 0
3 LCSubstring = [[0 for c in range(n2 + 1)] for r in range(n1 + 1)]
4 for i in range(n1 + 1):
5 for j in range(n2 + 1):
6 if (i == 0 or j == 0):
7 LCSubstring[i][j] = 0
8 elif (s1[i-1] == s2[j-1]):
9 LCSubstring[i][j] = LCSubstring[i-1][j-1] + 1
10 res = max(res, LCSubstring[i][j])
11 else: LCSubstring[i][j] = 0
12 return res

LCS hosted with ❤ by GitHub view raw


Implementation Proposition of Longest Common Substring

Example: the LCS of strings ‘Lydia’ and ‘Lydoko’ simply returns 3.

Levenshtein Edit Distance (V. Levenshtein, 1965):

The Levenshtein distance is another common example of character-based similarity


measure. It quantifies how dissimilar two text units are to one another by
computing the minimum number of single-character edits (replacement, deletion
and insertion operations) required to convert text unit 1 into text unit 2.
Mathematically, it can be written as:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 3/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Properties:

Applications:
- in approximate string matching where the objective is to find matches for short
strings in longer texts (spelling correction, OCR correction systems, …),
- in bioinformatics to quantify the similarity of DNA sequences (which can be
viewed as strings of letters A, C, G and T),
- and it can also be used in music to measure the similarity of melodies.

1 def LevenshteinEditDistance(s1, s2, n1, n2):


2 if n1 == 0: return n2
3 if n2 == 0: return n1
4 if s1[n1-1] == s2[n2-1]:
5 return LevenshteinEditDistance(s1, s2, n1-1, n2-1)
6 return 1 + min(LevenshteinEditDistance(s1, s2, n1, n2-1), # insert
7 LevenshteinEditDistance(s1, s2, n1-1, n2), # delete
8 LevenshteinEditDistance(s1, s2, n1-1, n2-1)) # replace
9

LevenshteinEditDistance hosted with ❤ by GitHub view raw


Implementation Proposition of Levenshtein Distance

Example: changing ‘hubert’ into ‘uber’ would only need 2 operations, thus
lev(‘hubert’, ‘uber’) = 2.

NB: different definitions of edit distances use different sets of operations. The
Longest common subsequence distance allows only insertion and deletion but not
substitution. Hamming distance allows only substitution and therefore only applies
to strings of same lengths. Damerau–Levenshtein distance allows insertion,
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 4/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

deletion, replacement and transposition of two adjacent characters. Jaro distance


only allows transposition.

Hamming Distance (R. Hamming, 1950):

Another character-based similarity measure is the Hamming distance. Hamming


distance between two equal size strings measures the minimum number of
replacements required to change one string into the other. Mathematically, this can
be written as follows:

Properties:

Application: mainly used in coding theory for error detection and correction (note
that the following representation also exhibits the fact that the Hamming distance of
binary chains is equivalent to the Manhattan distance between vertices).

3-bit binary cube (left) and 4-bit tesseract (right) for finding Hamming distances

1 def HammingDistance(s1, s2):


2 res = 0
3 for i in range(len(s1)):
4 if s1[i] != s2[i]: res += 1
5 return res

H i Di t h t d ith ❤ b GitH b
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55
view raw 5/16
view raw
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
HammingDistance hosted with ❤ by GitHub
Implementation Proposition of Hamming Distance

Example: the Hamming distance between strings ‘Lydia’ and ‘Media’ is 2.

Cosine Similarity:

The cosine similarity measures the proximity between two non-zero vectors of a
pre-Hilbert space. The cosine similarity of two text units simply computes the
cosine of the angle formed by the two vectors representing the text units, i.e. the
inner product of Euclidian space of the normalized vectors. When close to 1, the two
units are close in the chosen vector space, when close to -1, the two units are far
apart.

Therefore, this metric is only an appreciation of orientation and not of magnitude:


two vectors with the same orientation have a cosine similarity of 1, two vectors
oriented at 90° to each other have a similarity of 0, and two diametrically opposed
vectors have a similarity of -1.

Open in app Sign up Sign In


Example of cosine similarities in a 2-Dimension space
Search
Jaccard Distance:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 6/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

The Jaccard distance measures how dissimilar two multisets are: the lower the
distance, the more similar the two multisets. It is computed using the Jaccard index
(or Jaccard similarity coefficient) which is the ratio of the cardinal of the
intersection of the multisets to the cardinal of their union. The distance is then
obtained by subtracting the index from 1. Mathematically, it can be written as:

These expressions are undefined if both A and B are empty sets, in which case we
define the index and distance to be 1 and 0 respectively.

The Jaccard distance works quite well in practice, especially for sparse data. For
example, a streaming service like Netflix could represent customers as multisets of
movies watched, and then use Jaccard distance to measure the similarity between
two customers, i.e. how close their tastes are. Then, based on the preferences of two
users and their similarity we could potentially make recommendations to one or the
other.
Similarly, if we represent documents in terms of the multisets of words they contain,
then the Jaccard distance between two of the documents is often a reasonable
measure of their similarity. In this case we would represent the multisets A and B as
vectors va and vb with the i-th index of vector va equalling the number of times that
the i-th element is represented in A:

Jaccard distance used to compare text units represented as BoW will typically
present some flaws: as the size of the document increases, the number of common
words tend to increase even if the documents talk about different topics. Moreover,
this metric will not be able to capture the similarity between different text units that
have the same meaning but are written differently (this issue is more an issue of text
representation but since Jaccard distance is particularly well suited for BoW
strategy, it still becomes a concern). For example these two text blobs have the same
meaning but there Jaccard distance will be close to 1 since they do not share any
common words:
Text unit 1: President greets the press in Chicago
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 7/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Text unit 2: Obama speaks in Illinois


Another concern is the sense of the sentence:
Text unit 1: The dog bites the man
Text unit 2: The man bites the dog
Although these two units have totally different meaning, the Jaccard distance will
consider them as being the most similar.

Euclidean Distance

Euclidean distance is a token-based similarity distance. Given two points in Rn, the
Euclidean distance metric is simply the length of a line segment between these two
points. It is also often referred as the l2-norm, the l2-distance or the Pythagorean
metric and can be expressed as:

Properties:
The Euclidean distance is symmetric, positive and obeys the triangle inequality.

One of the important properties of this norm, relative to other norms, is that it
remains unchanged under arbitrary rotations of space around the origin.

Applications:
- in Euclidean geometry, to find the shortest distance between two points in a
Euclidean space,
- in clustering algorithms such as K-means,
- in statistics to measure the similarity between data points, or to use in methods
such as least squares (where we use the squared Euclidean distance because it
allows convex analysis)

More generally, the Minkowski or lp distance is a generalization of the Euclidean


distance:

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 8/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

It is equal to the Manhattan distance if p=1, equal to the Euclidean distance if p=2,
and equal to the Chebyshev distance (maximum or L∞ metric) if p approaches
infinity.

Conclusion
Now that you have an overview of some of the most common distance metrics used
in NLP, you can choose to continue working on other metrics or work on text
vectorization to add to your knowledge.
I hope you enjoyed this article! Please feel free to contact me if you have any
questions or if you feel that additional explanations should be added to this article.

Thanks for reading!


*All images (including formulas)are by the author except where stated otherwise

Machine Learning NLP Naturallanguageprocessing Cosine Similarity

Levenshtein Distance

Follow

Written by Flavien Vidal


18 Followers

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 9/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

More from Flavien Vidal

Flavien Vidal

Negative Sampling vs Hierarchical Softmax


In this article, we explore and compare the two approaches presented by Mikolov et al. in their
paper: “Distributed Representations of…

6 min read · Nov 2, 2021

15

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 10/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Flavien Vidal

Exposition and Interpretation of the Topology of Neural Networks


Neural networks are powerful and widely used tools but their interpretability is often far from
ideal. For this reason, it is interesting…

11 min read · Nov 11, 2021

21

Flavien Vidal

Architecting World’s Largest Biometric Identity System: Aadhaar


Experience
In this article, we will discuss the Aadhaar project, which was launched on January 28, 2009
and received a lot of international attention…

4 min read · Oct 19, 2021

See all from Flavien Vidal

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 11/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Recommended from Medium

Mohamad Mahmood in Dev Genius

TFIDF Calculation Using SKLearn’s TfidfVectorizer


accompanied by the step-by-step manual TFIDF calculation

8 min read · Jul 19

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 12/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Farhan Sarguroh

03: Basic Text Preprocessing (NLP)


Text preprocessing is an essential step in natural language processing (NLP) tasks that
involves cleaning and transforming raw text data…

6 min read · Jul 10

Lists

Predictive Modeling w/ Python


20 stories · 516 saves

Practical Guides to Machine Learning


10 stories · 589 saves

Natural Language Processing


739 stories · 331 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 163 saves

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 13/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Anar Abiyev in Artificial Intelligence in Plain English

Sentence Similarity with Sentence Transformers for NLP projects


Clear explanation and Python codes to apply sentence transformers in sentence similarity
tasks.

· 3 min read · Sep 23

22

Yassine EL KHAL

The complete guide to string similarity algorithms


https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 14/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

Introduction

14 min read · Aug 21

139 1

Dhruv Matani

All Pairs Cosine Similarity in PyTorch


PyTorch defines a cosine_similarity function to compute pairwise cosine similarity between
pairs of vectors. However, there’s no method to…

7 min read · Jun 8

15 1

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 15/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium

TechClaw

Cosine similarity between two arrays for word embeddings


Introduction

2 min read · Jul 11

See more recommendations

https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 16/16

You might also like