TF-IDF
TF-IDF
TF-IDF stands for Term Frequency Inverse Document Frequency of records.It can be defined as
the calculation of how relevant a word in a series or corpus is to a text.
TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a
collection or corpus of documents. It's commonly used in text mining, natural language
processing (NLP), and information retrieval to rank documents based on the relevance of a
search term.
The TF-IDF value increases proportionally to the number of times a word appears in the
document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
TF-IDF is much more preferred than Bag-Of-Words, in which every word, is represented as 1 or
0, every time it gets appeared in each Sentence, while, in TF-IDF, gives weightage to each Word
separately, which in turn defines the importance of each word than others.
Components of TF-IDF:
TF-IDF is composed of two parts:
•Term Frequency (TF)
•Inverse Document Frequency (IDF)
Term Frequency (TF):
TF measures how frequently a term appears in a document. The assumption is that the more a word
occurs in a document, the more important it is to that document.
Since every document is different in length, it is possible that a term would appear much more times in
long documents than shorter ones. Thus, the term frequency is often divided by the document length
(aka. the total number of terms in the document) as a way of normalization:
The formula for Term Frequency (TF) is:
Where:
•t is the term or word.
•Total number of sentences is the total count of sentences in the corpus.
•Number of sentences containing the term t is how many sentences include the term t.
High IDF value means the term is rare across documents, making it more important.
Low IDF value means the term is common, making it less valuable for distinguishing documents.
Working procedure of TF-IDF
The working procedure of TF-IDF involves calculating Term Frequency (TF) and Inverse Document
Frequency (IDF) for each term in a document or corpus, and then multiplying these two values together
to get the TF-IDF score. Here’s a step-by-step breakdown of how it works:
Step-by-Step Procedure of TF-IDF Calculation
Step 1: Collect the Documents (Corpus)
•Start with a collection of documents (or sentences) that form the corpus.
•Each document is treated as a bag of words (i.e., no particular order is assumed for the words).
So, Number of times the word “Good” appears in Sentence 1 is, 1 Time, and the Total number of
times the word “Good”, appears in all three Sentences is 3 times, so the TF(Term Frequency) value of
word “Good” is, TF(“Good”)=1/3=0.333.
Now, lets consider the value of each word, with reference to each sentence, in a tabular
format, which can be shown as,
Now,
Lets Consider Second of TF-IDF, That is, IDF(Inverse Document Frequency) of Each word, with respect
to each Sentence.
As we know,
High TF-IDF score indicates that the term is frequent in a particular document but rare across the
corpus, making it significant for that document.
Low TF-IDF score implies that the term is either not very frequent in the document or common
across the entire corpus, reducing its relevance.
So,
As a Conclusion, we can see that, the word “Good”, appears in each of these 3 sentences, as a result
the Value of the word “Good” is Zero, while the Word “Boy” appears only 2 times, in each of these 3
sentences, As a results, we can see, in Sentence 1, the Value(Importance) of word “Boy” is more than
the Word “Good”.
As a result, we can see that, TF-IDF, gives Specific Value or Importance to each Word, in any
paragraph, The terms with higher weight scores are considered to be more importance, as a result TF-
IDF has replaced the concept of “Bag-Of-Words” which gives the Same Value to each word, when
occurred in any sentence of Paragraph, which is Major Disadvantage of Bag-Of-Words.
TF-IDF was invented for document search and can be used to deliver results that are most relevant to
what you’re searching for. Imagine you have a search engine and somebody looks for “Dog”. The
results will be displayed in order of relevance. That’s to say the most relevant Pet articles will be
ranked higher because TF-IDF gives the word “Dog” a higher score.
Applications of TF-IDF:
•Document retrieval systems: Search engines use TF-IDF to rank documents based on the relevance
of the search query.
•Text classification: TF-IDF is used as a feature for training machine learning models in text
classification tasks.
•Keyword extraction: TF-IDF helps identify key terms in documents for summarization or keyword
extraction.
•Content filtering: TF-IDF can help recommend articles or documents by comparing the importance of
terms within different pieces of text.
Limitations of TF-IDF:
•Not context-aware: TF-IDF does not account for the meaning or context of the words, as it relies
solely on frequency.
•Synonym handling: It doesn't recognize synonyms (e.g., "car" and "automobile" will be treated as
separate terms).
•Ignoring semantic relationships: TF-IDF doesn't understand relationships between words (e.g., "cat"
and "animal" will be treated independently).