Similarity Distances For Natural Language Processing
Similarity Distances For Natural Language Processing
Listen Share
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 1/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
the case of textual data and address the question: how to handle feature engineering
with textual data?
Some of the most common ways to capture similarity between text units are:
- Longest Common Substring (LCS),
- Levenshtein Edit Distance,
- Hamming Distance,
- Cosine Similarity,
- Jaccard Distance,
- Euclidean Distance,
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 2/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
1 def LCS(s1, s2):
2 n1, n2, res = len(s1), len(s2), 0
3 LCSubstring = [[0 for c in range(n2 + 1)] for r in range(n1 + 1)]
4 for i in range(n1 + 1):
5 for j in range(n2 + 1):
6 if (i == 0 or j == 0):
7 LCSubstring[i][j] = 0
8 elif (s1[i-1] == s2[j-1]):
9 LCSubstring[i][j] = LCSubstring[i-1][j-1] + 1
10 res = max(res, LCSubstring[i][j])
11 else: LCSubstring[i][j] = 0
12 return res
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 3/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Properties:
Applications:
- in approximate string matching where the objective is to find matches for short
strings in longer texts (spelling correction, OCR correction systems, …),
- in bioinformatics to quantify the similarity of DNA sequences (which can be
viewed as strings of letters A, C, G and T),
- and it can also be used in music to measure the similarity of melodies.
Example: changing ‘hubert’ into ‘uber’ would only need 2 operations, thus
lev(‘hubert’, ‘uber’) = 2.
NB: different definitions of edit distances use different sets of operations. The
Longest common subsequence distance allows only insertion and deletion but not
substitution. Hamming distance allows only substitution and therefore only applies
to strings of same lengths. Damerau–Levenshtein distance allows insertion,
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 4/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Properties:
Application: mainly used in coding theory for error detection and correction (note
that the following representation also exhibits the fact that the Hamming distance of
binary chains is equivalent to the Manhattan distance between vertices).
3-bit binary cube (left) and 4-bit tesseract (right) for finding Hamming distances
H i Di t h t d ith ❤ b GitH b
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55
view raw 5/16
view raw
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
HammingDistance hosted with ❤ by GitHub
Implementation Proposition of Hamming Distance
Cosine Similarity:
The cosine similarity measures the proximity between two non-zero vectors of a
pre-Hilbert space. The cosine similarity of two text units simply computes the
cosine of the angle formed by the two vectors representing the text units, i.e. the
inner product of Euclidian space of the normalized vectors. When close to 1, the two
units are close in the chosen vector space, when close to -1, the two units are far
apart.
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 6/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
The Jaccard distance measures how dissimilar two multisets are: the lower the
distance, the more similar the two multisets. It is computed using the Jaccard index
(or Jaccard similarity coefficient) which is the ratio of the cardinal of the
intersection of the multisets to the cardinal of their union. The distance is then
obtained by subtracting the index from 1. Mathematically, it can be written as:
These expressions are undefined if both A and B are empty sets, in which case we
define the index and distance to be 1 and 0 respectively.
The Jaccard distance works quite well in practice, especially for sparse data. For
example, a streaming service like Netflix could represent customers as multisets of
movies watched, and then use Jaccard distance to measure the similarity between
two customers, i.e. how close their tastes are. Then, based on the preferences of two
users and their similarity we could potentially make recommendations to one or the
other.
Similarly, if we represent documents in terms of the multisets of words they contain,
then the Jaccard distance between two of the documents is often a reasonable
measure of their similarity. In this case we would represent the multisets A and B as
vectors va and vb with the i-th index of vector va equalling the number of times that
the i-th element is represented in A:
Jaccard distance used to compare text units represented as BoW will typically
present some flaws: as the size of the document increases, the number of common
words tend to increase even if the documents talk about different topics. Moreover,
this metric will not be able to capture the similarity between different text units that
have the same meaning but are written differently (this issue is more an issue of text
representation but since Jaccard distance is particularly well suited for BoW
strategy, it still becomes a concern). For example these two text blobs have the same
meaning but there Jaccard distance will be close to 1 since they do not share any
common words:
Text unit 1: President greets the press in Chicago
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 7/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Euclidean Distance
Euclidean distance is a token-based similarity distance. Given two points in Rn, the
Euclidean distance metric is simply the length of a line segment between these two
points. It is also often referred as the l2-norm, the l2-distance or the Pythagorean
metric and can be expressed as:
Properties:
The Euclidean distance is symmetric, positive and obeys the triangle inequality.
One of the important properties of this norm, relative to other norms, is that it
remains unchanged under arbitrary rotations of space around the origin.
Applications:
- in Euclidean geometry, to find the shortest distance between two points in a
Euclidean space,
- in clustering algorithms such as K-means,
- in statistics to measure the similarity between data points, or to use in methods
such as least squares (where we use the squared Euclidean distance because it
allows convex analysis)
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 8/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
It is equal to the Manhattan distance if p=1, equal to the Euclidean distance if p=2,
and equal to the Chebyshev distance (maximum or L∞ metric) if p approaches
infinity.
Conclusion
Now that you have an overview of some of the most common distance metrics used
in NLP, you can choose to continue working on other metrics or work on text
vectorization to add to your knowledge.
I hope you enjoyed this article! Please feel free to contact me if you have any
questions or if you feel that additional explanations should be added to this article.
Levenshtein Distance
Follow
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 9/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Flavien Vidal
15
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 10/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Flavien Vidal
21
Flavien Vidal
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 11/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 12/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
Farhan Sarguroh
Lists
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 13/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
22
Yassine EL KHAL
Introduction
139 1
Dhruv Matani
15 1
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 15/16
24/10/2023, 14:16 Similarity Distances for Natural Language Processing | by Flavien Vidal | Medium
TechClaw
https://flavien-vidal.medium.com/similarity-distances-for-natural-language-processing-16f63cd5ba55 16/16