2-Text Operations_new
2-Text Operations_new
Operations
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T
Statistical Properties of Text
• This means that the search results may not always be relevant to
the user's query, as the search engine does not take into account
synonyms, stemming, or other language processing techniques.
Text Operations …
Text operations is the process of text transformations in to
logical representations.
Text operations refer to the various methods used to process,
analyse, and manipulate textual data in order to extract relevant
information and improve the effectiveness of IR systems.
The main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms
are:
Lexical analysis/Tokenization of the text:- digits, hyphens,
punctuations marks, and the case of letters.
Elimination of stop words:- filter out words which are not useful
in the retrieval process.
Stemming words:- remove affixes (prefixes and suffixes)
Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms.
Generating Document
Representatives
Text Processing System:
Input text:- full text, abstract or title.
Output:- a document representative adequate for use in an
automatic retrieval system.
The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text.
A document will be indexed by a name if one of its significant
words occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus
Index
terms
Generating Document
Representatives
The cat slept peacefully in the living room. It’s a very old cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
//example Python code for tokenization:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
text = "This is an example sentence. This is another example sentence!"
# Tokenize into words
words = word_tokenize(text)
# Tokenize into sentences OUTPUT
sentences = sent_tokenize(text) ['This', 'is', 'an', 'example', 'sentence', '.', 'This', 'is', 'another',
'example', 'sentence', '!']
# Print the tokens
print(words)
print(sentences)
Elimination of STOPWORD
Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
With the removal of stopwords, we can measure better
approximation of importance for classification, summarization, etc.
Stop words
But the trend is getting away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little at query
time for including stop words.
You need stop words for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stemming/Morphological
analysis
Stemming reduces tokens to their “root” form of words to recognize
morphological variation.
The process involves removal of affixes (i.e. prefixes and suffixes) with
the aim of reducing variants to the same stem.
Often removes inflectional and derivational morphology of a word.
Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy.
Stemming is language dependent:
Correct stemming is language specific and can be complex.
• The first approach is to create a big dictionary that maps words to their
stems.
– Dictionary-based stemming: This method uses a pre-defined dictionary or lookup
table to map each word to its base form. This approach is useful for irregular
languages where rule-based stemming may not work effectively.
• The second approach is to use a set of rules that extract stems from
words.
– Rule-based stemming: This method involves creating a set of rules to remove
suffixes from words to reduce them to their base form. This approach is effective
for languages with regular grammatical rules, such as English.
• The third approach is to combines the two method.
– Hybrid approach: This method combines both rule-based and dictionary-based
stemming. It uses a set of rules to apply stemming to most words and a lookup
table to handle irregular cases.
Ways to implement stemming
Stemming
Approach Advantages Disadvantages
- Not effective for languages with
- Simple to implement
complex grammatical rules or
Rule-based - Works well for languages with
irregularities Can produce stem words
regular grammatical rules
that are not actual words
- Requires a large pre-defined
- Effective for languages with
dictionary, which can be time-
irregular grammatical rules.
Dictionary-based consuming to create and maintain.
- Produces more accurate stem
- May not be suitable for languages
words
with constantly evolving vocabularies
- Combines strengths of rule-
based and dictionary-based
- Can be more complex to implement.
stemming
- May require additional computational
Hybrid - Produces more accurate stem
resources compared to the other two
words
approaches
- Effective for a wider range of
languages
Porter Stemmer
04/01/25 38
Thank You !!!
04/01/25 39