0% found this document useful (0 votes)
20 views15 pages

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a corpus, commonly applied in text mining and information retrieval. It consists of two components: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which assesses the rarity of a term across the entire corpus. TF-IDF is preferred over the Bag-Of-Words model as it assigns different weights to words based on their significance, and is utilized in applications like document retrieval, text classification, and keyword extraction.

Uploaded by

Ha Yanga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a corpus, commonly applied in text mining and information retrieval. It consists of two components: Term Frequency (TF), which measures how often a term appears in a document, and Inverse Document Frequency (IDF), which assesses the rarity of a term across the entire corpus. TF-IDF is preferred over the Bag-Of-Words model as it assigns different weights to words based on their significance, and is utilized in applications like document retrieval, text classification, and keyword extraction.

Uploaded by

Ha Yanga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency Inverse Document Frequency of records.It can be defined as
the calculation of how relevant a word in a series or corpus is to a text.
TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a
collection or corpus of documents. It's commonly used in text mining, natural language
processing (NLP), and information retrieval to rank documents based on the relevance of a
search term.
The TF-IDF value increases proportionally to the number of times a word appears in the
document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.

TF-IDF is much more preferred than Bag-Of-Words, in which every word, is represented as 1 or
0, every time it gets appeared in each Sentence, while, in TF-IDF, gives weightage to each Word
separately, which in turn defines the importance of each word than others.
 Components of TF-IDF:
TF-IDF is composed of two parts:
•Term Frequency (TF)
•Inverse Document Frequency (IDF)
Term Frequency (TF):
TF measures how frequently a term appears in a document. The assumption is that the more a word
occurs in a document, the more important it is to that document.
Since every document is different in length, it is possible that a term would appear much more times in
long documents than shorter ones. Thus, the term frequency is often divided by the document length
(aka. the total number of terms in the document) as a way of normalization:
The formula for Term Frequency (TF) is:

 High TF value means the term appears frequently in the document.


 Low TF value means the term is less frequent.
Inverse Document Frequency (IDF):
Inverse Document Frequency, which measures the importance of a term across all documents in the
corpus. While computing TF, all terms are considered equally important.
However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have
little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by
computing the following:
The formula for Inverse Document Frequency (IDF) is:

Where:
•t is the term or word.
•Total number of sentences is the total count of sentences in the corpus.
•Number of sentences containing the term t is how many sentences include the term t.

 High IDF value means the term is rare across documents, making it more important.
 Low IDF value means the term is common, making it less valuable for distinguishing documents.
Working procedure of TF-IDF

The working procedure of TF-IDF involves calculating Term Frequency (TF) and Inverse Document
Frequency (IDF) for each term in a document or corpus, and then multiplying these two values together
to get the TF-IDF score. Here’s a step-by-step breakdown of how it works:
Step-by-Step Procedure of TF-IDF Calculation
Step 1: Collect the Documents (Corpus)
•Start with a collection of documents (or sentences) that form the corpus.
•Each document is treated as a bag of words (i.e., no particular order is assumed for the words).

Step 2: Preprocess the Text (Optional)


•Clean and preprocess the text, including:
• Tokenization: Breaking down text into individual words (tokens).
• Lowercasing: Converting all words to lowercase to avoid case-sensitive issues.
• Removing stop words: Exclude common words like "and", "is", "the" (not always required).
• Stemming or Lemmatization: Reducing words to their root forms.
Step 3: Calculate Term Frequency (TF)
For each document, compute the Term Frequency (TF) for each term (word).
The TF of a term t in a document d is:

Step 4: Calculate Inverse Document Frequency (IDF)


Next, compute the IDF for each term across the entire corpus.
The IDF of a term t is:

Step 5: Calculate TF-IDF


Finally, for each term in each document, calculate the TF-IDF score by multiplying its TF and IDF values:
Step 6: Ranking Terms by TF-IDF
•After calculating the TF-IDF score for all terms in all documents, you can rank the terms in each
document based on their TF-IDF score.
•Terms with higher TF-IDF scores are considered more important for that specific document
compared to the overall corpus.

Step 7: Apply TF-IDF to Specific Tasks


•Document Retrieval: Find documents most relevant to a search query by comparing TF-IDF values
for query terms.
•Text Classification: Use TF-IDF values as features in a machine learning model to classify documents
into different categories.
•Keyword Extraction: Identify important terms in a document by ranking words based on their TF-IDF
scores.
Let’s Consider these Three sentences:
1.He is a Good Boy
2. She is a Good Girl, and,
3. Both are Good Boy, and Girl, respectively.
So, after using Regular Expression, stop-words and other Functions from NLTK library, we get purified
version of these three sentences, which can be shown as,
2.Good Boy
2. Good Girl, and
3.Good Boy Girl, respectively.
Now, Lets Consider TF(Term Frequency) operations,
Let’s assume a word “Good”, in sentence 1, as we know,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

So, Number of times the word “Good” appears in Sentence 1 is, 1 Time, and the Total number of
times the word “Good”, appears in all three Sentences is 3 times, so the TF(Term Frequency) value of
word “Good” is, TF(“Good”)=1/3=0.333.
Now, lets consider the value of each word, with reference to each sentence, in a tabular
format, which can be shown as,
Now,
Lets Consider Second of TF-IDF, That is, IDF(Inverse Document Frequency) of Each word, with respect
to each Sentence.
As we know,

Again, lets consider, the word “Good”, in


Sentence 1,Now, we know that Total Number of
Sentences we have is 3(Total number of
documents), also , We know the word ” Good”
appears overall 3 times, considering all 3
sentences, so, Number of documents with term
“Good” in it=3,
So, IDF (Inverse Document Frequency) Value of
word “Good” would be “ Log(3/3)”, Now, lets
consider the IDF( Inverse Document
Frequency ) Value of each word, in a Tabular For,
TF-IDF Calculation:
Now, We have Values for both, TF( Term Frequency ) as well as IDF( Inverse Document Frequency ) for
each word, for each Sentence we have,
So, Finally the TF-IDF Value for each word would be= TF(Value)*IDF(Value).
Let’s Present TF-IDF Value of each word in a tabular form given below,

 High TF-IDF score indicates that the term is frequent in a particular document but rare across the
corpus, making it significant for that document.
 Low TF-IDF score implies that the term is either not very frequent in the document or common
across the entire corpus, reducing its relevance.
So,

As a Conclusion, we can see that, the word “Good”, appears in each of these 3 sentences, as a result
the Value of the word “Good” is Zero, while the Word “Boy” appears only 2 times, in each of these 3
sentences, As a results, we can see, in Sentence 1, the Value(Importance) of word “Boy” is more than
the Word “Good”.

As a result, we can see that, TF-IDF, gives Specific Value or Importance to each Word, in any
paragraph, The terms with higher weight scores are considered to be more importance, as a result TF-
IDF has replaced the concept of “Bag-Of-Words” which gives the Same Value to each word, when
occurred in any sentence of Paragraph, which is Major Disadvantage of Bag-Of-Words.

TF-IDF was invented for document search and can be used to deliver results that are most relevant to
what you’re searching for. Imagine you have a search engine and somebody looks for “Dog”. The
results will be displayed in order of relevance. That’s to say the most relevant Pet articles will be
ranked higher because TF-IDF gives the word “Dog” a higher score.
 Applications of TF-IDF:
•Document retrieval systems: Search engines use TF-IDF to rank documents based on the relevance
of the search query.
•Text classification: TF-IDF is used as a feature for training machine learning models in text
classification tasks.
•Keyword extraction: TF-IDF helps identify key terms in documents for summarization or keyword
extraction.
•Content filtering: TF-IDF can help recommend articles or documents by comparing the importance of
terms within different pieces of text.

 Limitations of TF-IDF:
•Not context-aware: TF-IDF does not account for the meaning or context of the words, as it relies
solely on frequency.
•Synonym handling: It doesn't recognize synonyms (e.g., "car" and "automobile" will be treated as
separate terms).
•Ignoring semantic relationships: TF-IDF doesn't understand relationships between words (e.g., "cat"
and "animal" will be treated independently).

You might also like