How To Compute The Similarity Between Two Text Documents - Baeldung On Computer Science
How To Compute The Similarity Between Two Text Documents - Baeldung On Computer Science
(/cs/) (https://www.baeldung.com/cs/)
(https://freestar.com/?
ing&utm_medium=banner&utm_source=baeldung.com&utm_content=ba
atf)
by Francesco Elia
(https://www.baeldung.com/cs/author/francescoelia)
Machine Learning
(https://www.baeldung.com/cs/category/ai/ml)
https://www.baeldung.com/cs/ml-similarities-in-text 1/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
1. Introduction
Computing the similarity between two text documents is a common
task in NLP, with several practical applications. It has commonly been
used to, for example, rank results in a search engine or recommend
similar content to readers.
Since text similarity is a loosely-defined term, we’ll first have to define it
for the scope of this article. After that, we’ll explore two different ways of
computing similarity and their pros and cons.
TestGorilla Op
(https://freestar.com/?
campaign=branding&utm_medium=banner&utm_source=baeldung.com&
ntent=baeldung_leaderboard_mid_1)
Nevertheless, it’s safe to say that we’d want an ideal similarity algorithm
to return a high score for this pair.
https://www.baeldung.com/cs/ml-similarities-in-text 2/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
campaign=branding&utm_medium=&utm_source=baeldung.com&utm_co
baeldung_leaderboard_mid_2)
In the rest of the article, we’ll not cover advanced methods that are able
to compute fine-grained semantic similarity (for example, handling
negations). Instead, we’ll first cover the foundational aspects of
traditional document similarity algorithms. In the second part, we’ll
then introduce word embeddings which we can use to integrate at least
some semantic considerations.
3. Document Vectors
https://www.baeldung.com/cs/ml-similarities-in-text 3/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
campaign=branding&utm_medium=banner&utm_source=baeldung.com&
ntent=baeldung_leaderboard_mid_3)
https://www.baeldung.com/cs/ml-similarities-in-text 4/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
https://www.baeldung.com/cs/ml-similarities-in-text 5/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
t i b di & t
As we can see, the first two documents have the highest similarity, since
they share three words. Note that since the word pizza appears two
times it will contribute more to the similarity compared to “at”, “ate” or
“you”.
4. TF-IDF Vectors
The simplified example we’ve just seen uses simple word counts to
build document vectors. In practice, this approach has several problems.
First of all, the most common words are usually the less significant ones.
In fact, most documents will share many occurrences of words like
“the”, “is”, and “of” which will impact our similarity measurement
negatively.
These words are commonly referred to as stopwords and can be
removed in a preprocessing step. Nonetheless, a more sophisticated
approach like TF-IDF can be used to automatically give less weight to
frequent words in a corpus.
The idea behind TF-IDF is that we first compute the number of
documents in which a word appears in. If a word appears in many
documents, it will be less relevant in the computation of the similarity,
https://www.baeldung.com/cs/ml-similarities-in-text 6/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
and vice versa. We call this value the inverse document frequency or
IDF, and we can compute it as:
(https://freestar.com/?
t i b di & t
https://www.baeldung.com/cs/ml-similarities-in-text 7/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
t i b di & t
Given that, the final score for each word will be:
5. Word Embeddings
Word embeddings are high-dimensional vectors that represent words.
We can create them in an unsupervised way from a collection of
documents, in general using neural networks, by analyzing all the
contexts in which the word occurs.
https://www.baeldung.com/cs/ml-similarities-in-text 9/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
t i b di & t
This results in vectors that are similar (according to cosine similarity) for
words that appear in similar contexts, and thus have a similar meaning.
For example, since the words “teacher” and “professor” can sometimes
be used interchangeably, their embeddings will be close together.
For this reason, using word embeddings can enable us to handle
synonyms or words with similar meaning in the computation of
similarity, which we couldn’t do by using word frequencies.
However, word embeddings are just vector representations of words,
and there are several ways that we can use to integrate them into our
text similarity computation. In the next section, we’ll see a basic example
of how we can do this.
https://www.baeldung.com/cs/ml-similarities-in-text 11/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
(https://freestar.com/?
t i b di & t
They are useful because they allow us to go beyond a basic lexical
level, but we need to evaluate if this is necessary because they add
extra complexity.
6. Conclusion
Text similarity is a very active research field, and techniques are
continuously evolving and improving. In this article, we’ve given an
overview of possible ways to implement text similarity, focusing on the
Vector Space Model and Word Embeddings.
We’ve seen how methods like TF-IDF can help in weighting terms
appropriately, but without taking into account any semantic aspects. On
the other hand, Word Embeddings can help integrating semantics but
have their own downsides.
For these reasons, when choosing what method to use, it’s important to
always consider our use case and requirements carefully.
(https://freestar.com/?
CATEGORIES
ALGORITHMS (/CS/CATEGORY/ALGORITHMS)
ARTIFICIAL INTELLIGENCE (/CS/CATEGORY/AI)
CORE CONCEPTS (/CS/CATEGORY/CORE-CONCEPTS)
DATA STRUCTURES (/CS/CATEGORY/DATA-STRUCTURES)
GRAPH THEORY (/CS/CATEGORY/GRAPH-THEORY)
LATEX (/CS/CATEGORY/LATEX)
NETWORKING (/CS/CATEGORY/NETWORKING)
SECURITY (/CS/CATEGORY/SECURITY)
SERIES
DRAWING CHARTS IN LATEX (/CS/CATEGORY/SERIES)
ABOUT
ABOUT BAELDUNG (HTTPS://WWW.BAELDUNG.COM/ABOUT)
THE FULL ARCHIVE (/CS/FULL_ARCHIVE)
WRITE FOR BAELDUNG (/CONTRIBUTION-GUIDELINES)
https://www.baeldung.com/cs/ml-similarities-in-text 13/14
12/26/21, 3:28 PM How to Compute the Similarity Between Two Text Documents? | Baeldung on Computer Science
EDITORS (HTTPS://WWW.BAELDUNG.COM/EDITORS)
https://www.baeldung.com/cs/ml-similarities-in-text 14/14