2 Text Operations
2 Text Operations
Index
terms
Lexical Analysis/Tokenization of
Text
Change text of the documents into words to be adopted as index
terms.
Objective - identify words in the text:
Digits, hyphens, punctuation marks, case of letters.
Numbers are not good index terms (like 1910, 1999); but 510 B.C.
– unique.
Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens.
Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe.
Case of letters – not important and can convert all to upper or
lower.
Tokenization
The cat slept peacefully in the living room. It’s a very old cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
Elimination of STOPWORD
Intuition:
Stopwords have little semantic content; It is typical to remove
such high-frequency words.
Stopwords take up 50% of the text. Hence, document size reduces
by 30-50%.
Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
But the trend is getting away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little at query
time for including stop words.
You need stop words for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stemming/Morphological Analysis
03/28/24 31
Thank You !!!
03/28/24 32