0% found this document useful (0 votes)
3 views39 pages

2-Text Operations_new

Chapter 2 discusses the statistical properties of text and the importance of preprocessing in information retrieval (IR) systems, highlighting the need to reduce noise by selecting significant index terms. It covers methods such as tokenization, stop word elimination, and stemming to improve document representation and retrieval performance. The chapter also addresses the challenges of tokenization and the implementation of stemming techniques, including rule-based and dictionary-based approaches.

Uploaded by

abekum21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

2-Text Operations_new

Chapter 2 discusses the statistical properties of text and the importance of preprocessing in information retrieval (IR) systems, highlighting the need to reduce noise by selecting significant index terms. It covers methods such as tokenization, stop word elimination, and stemming to improve document representation and retrieval performance. The chapter also addresses the challenges of tokenization and the implementation of stemming techniques, including rule-based and dictionary-based approaches.

Uploaded by

abekum21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 2 : Text

Operations
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T
Statistical Properties of Text

 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a


corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.

 A few words are very common.


 2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only
once”.
Sample Word Frequency
Data
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, one needs to preprocess the text of a document in a collection
to be used as index terms.
 Using the set of all words in a collection to index documents creates too
much noise for the retrieval task.
 Reduce noise means reduce words which can be used to refer to the
document.
 Preprocessing is the process of controlling the size of the vocabulary or
the number of distinct words used as index terms.
 Preprocessing will lead to an improvement in the information retrieval
performance.
 However, some search engines on the Web omit preprocessing.
 Every word in the document is an index term.
Text Operations

• One example of a search engine that omits pre-processing and


treats every word in a document as an index term is the Simple
Search Service (SSS) provided by Amazon Web Services (AWS).

• SSS allows users to search unstructured text data stored in Amazon


S3 buckets without requiring any pre-processing or schema
definition. The search engine indexes every word in the document
and returns results based on exact keyword matches.

• This means that the search results may not always be relevant to
the user's query, as the search engine does not take into account
synonyms, stemming, or other language processing techniques.
Text Operations …
 Text operations is the process of text transformations in to
logical representations.
 Text operations refer to the various methods used to process,
analyse, and manipulate textual data in order to extract relevant
information and improve the effectiveness of IR systems.
 The main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms
are:
 Lexical analysis/Tokenization of the text:- digits, hyphens,
punctuations marks, and the case of letters.
 Elimination of stop words:- filter out words which are not useful
in the retrieval process.
 Stemming words:- remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms.
Generating Document
Representatives
 Text Processing System:
 Input text:- full text, abstract or title.
 Output:- a document representative adequate for use in an
automatic retrieval system.
 The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text.
 A document will be indexed by a name if one of its significant
words occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus

Index
terms
Generating Document
Representatives

• For instance, if a document contains significant words related to


sports, it will be indexed under the sports category. Similarly, if
a document contains words related to politics, it will be indexed
under the politics category.
• For example, suppose you have a document that talks about
various types of flowers and their properties. If the document
contains the word "rose," which is a significant word related to
the "flowers" category, the document will be indexed under the
"flowers" category.
• Similarly, if a document contains the word "election" or
"voting," it will be indexed under the "politics" category
Lexical
Analysis/Tokenization of
Text
 Change text of the documents into words to be adopted as index
terms.
 Objective - identify words in the text:
 Digits, hyphens, punctuation marks, case of letters.
 Numbers are not good index terms (like 1910, 1999); but 510 B.C.
– unique.
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens.
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe.
 Case of letters – not important and can convert all to upper or
lower.
Tokenization

 Analyze text into a sequence of discrete tokens (words).


 Input: “Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are


grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing.
 But what are valid tokens to omit?
Issues in Tokenization

 One word or multiple: How do you decide it is one token or


two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 State-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Arba Minch
 lowercase, lower-case, lower case ?
 Data base, database, data-base
 Numbers:
 Dates (3/12/19 vs. Mar. 12, 2019);
 Phone numbers,
 IP addresses (100.2.86.144)
Issues in Tokenization

 How to handle special cases involving apostrophes, hyphens


etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a token.
 However, frequently they are not.

 Simplest approach is to ignore all numbers and punctuation and


use only case-insensitive unbroken strings of alphabetic
characters as tokens.
 Generally, don’t index numbers as text, But often very useful.
 Will often index “meta-data” , including creation date, format, etc.
separately.
 Issues of tokenization are language specific.
 Requires the language to be known.
Exercise: Tokenization

 The cat slept peacefully in the living room. It’s a very old cat.
 Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
//example Python code for tokenization:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
text = "This is an example sentence. This is another example sentence!"
# Tokenize into words
words = word_tokenize(text)
# Tokenize into sentences OUTPUT
sentences = sent_tokenize(text) ['This', 'is', 'an', 'example', 'sentence', '.', 'This', 'is', 'another',
'example', 'sentence', '!']
# Print the tokens
print(words)
print(sentences)
Elimination of STOPWORD

 Stopwords are extremely common words across document collections that


have no discriminatory power.
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms.
 Examples of stop-words are articles, pronouns, prepositions,
conjunctions, etc.:
 Articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 Conjunctions/connectors (and, but, for, nor, or, so, yet),
 Verbs (is, are, was, were),
 Adverbs (here, there, out, because, soon, after) and
 Adjectives (all, any, each, every, few, many, some) can also be treated as stopwords.
 Stop-words are language dependent.
Stop words
 Intuition:
 Stop-words have little semantic content; It is typical to remove such
high-frequency words.

 In general, stop-words can account for a significant portion of the text,


ranging from 35% to 50% or more, depending on the text corpus.
 For example, in a typical English text corpus, stop-words can make up
around 40% of the total words. This means that out of every 100 words
in the text, around 40 of them are likely to be stop-words.
 Stopwords take up 50% of the text. Hence, document size reduces by 30-
50%.
 Smaller indices for information retrieval.
 Good compression techniques for indices: The 30 most common words
account for 30% of the tokens in written text.
 Better approximation of importance for classification, summarization, etc.
How to determine a list of
Stop words?
 One method: Sort terms (in decreasing order) by collection frequency
and take the most frequent ones.
 Problem: In a collection about insurance practices, “insurance”
would be a stop word.

 Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
 With the removal of stopwords, we can measure better
approximation of importance for classification, summarization, etc.
Stop words

 Stop word elimination used to be standard in older IR systems.

 But the trend is getting away from doing this. Most web search
engines index stop words:
 Good query optimization techniques mean you pay little at query
time for including stop words.
 You need stop words for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stemming/Morphological
analysis
 Stemming reduces tokens to their “root” form of words to recognize
morphological variation.
 The process involves removal of affixes (i.e. prefixes and suffixes) with
the aim of reducing variants to the same stem.
 Often removes inflectional and derivational morphology of a word.
 Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy.
 Stemming is language dependent:
 Correct stemming is language specific and can be complex.

for example compressed and for example compress and


compression are both accepted. compress are both accept
Stemming/Morphological
analysis
 Removing inflectional and derivational morphology of a word" means reducing a
word to its simplest form by stripping away any suffixes, prefixes, or other
affixes that modify the base form of the word to create new forms or to change its
grammatical function.
 Inflectional morphology example: The verb "run" can be inflected to reflect
different tenses, such as "ran" for past tense or "running" for present participle.
 Derivational morphology example: The verb "run" can be derived into the noun
"runner" by adding the suffix "-er", which indicates a person who performs the
action of the verb.
 To remove inflectional and derivational morphology from the above examples:
 Inflectional morphology example: "Run" can be reduced to its base form "run" by removing the
inflectional suffixes "-an" and "-ing".
 Derivational morphology example: "Runner" can be reduced to its base form "run" by removing
the derivational suffix "-er".
 Removing inflectional and derivational morphology is a common step in processing tasks in
information retrieval systems.
Stemming

 Stemming is the process of reducing inflected (or sometimes derived)


words to their word stem.
 A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
 Thus, [automate, automatic, automation] all reduce to 
automat
 A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
 A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
 Queries : Queries are handled in the same way.
Ways to implement stemming

• The first approach is to create a big dictionary that maps words to their
stems.
– Dictionary-based stemming: This method uses a pre-defined dictionary or lookup
table to map each word to its base form. This approach is useful for irregular
languages where rule-based stemming may not work effectively.
• The second approach is to use a set of rules that extract stems from
words.
– Rule-based stemming: This method involves creating a set of rules to remove
suffixes from words to reduce them to their base form. This approach is effective
for languages with regular grammatical rules, such as English.
• The third approach is to combines the two method.
– Hybrid approach: This method combines both rule-based and dictionary-based
stemming. It uses a set of rules to apply stemming to most words and a lookup
table to handle irregular cases.
Ways to implement stemming

Stemming
Approach Advantages Disadvantages
- Not effective for languages with
- Simple to implement
complex grammatical rules or
Rule-based - Works well for languages with
irregularities Can produce stem words
regular grammatical rules
that are not actual words
- Requires a large pre-defined
- Effective for languages with
dictionary, which can be time-
irregular grammatical rules.
Dictionary-based consuming to create and maintain.
- Produces more accurate stem
- May not be suitable for languages
words
with constantly evolving vocabularies
- Combines strengths of rule-
based and dictionary-based
- Can be more complex to implement.
stemming
- May require additional computational
Hybrid - Produces more accurate stem
resources compared to the other two
words
approaches
- Effective for a wider range of
languages
Porter Stemmer

 Stemming is the operation of stripping the suffices from a word, leaving


its stem.
 Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.

 In 1979, Martin Porter developed a stemming algorithm that uses a set


of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
 Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer

 It is the most common algorithm for stemming English words to


their common grammatical root.
 It uses a simple procedure for removing known affixes in English
without using a dictionary. To gets rid of plurals the following
rules are used:
 SSES  SS caresses  caress
 IES  i ponies  poni
 SS  SS caress → caress
 S   (nil) cats  cat
 EMENT   (Delete final element if what remains is longer than
1 character )
 replacement  replac
 cement  cement
Porter stemmer

 While step 1a gets rid of plurals, step 1b removes -ed or -ing.


– e.g.
;; agreed -> agree
– ;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
;; feed -> feed
Stemming: challenges

 May produce unusual stems that are not English words:


 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are


actually distinct.
 “computer”, “computational”, “computation” all reduced to same
token “comput”

 Not recognize all morphological derivations.


Assignment

 In a group of three study in detailed and adopt porter’s


stemming algorithm using python.
• I expect a detailed explanation of Porter's stemming algorithm,
along with a Python implementation. Porter's stemming algorithm is
a widely used algorithm for English language stemming. It
involves a set of rules for removing suffixes prefix, and infix from
words to obtain their root form.
• The algorithm consists of different phases, each of which applies a
set of rules to the word in order to remove a particular type of suffix,
prefix, and infix.
Thesaurus/Thesauri

 A thesaurus is a type of vocabulary that lists words and their


synonyms, antonyms, and related words.
 Mostly full-text searching cannot be accurate, since different authors
may select different words to represent the same concept.
 Problem: The same meaning can be expressed using different terms that
are synonyms.
 Synonym: a word or phrase which has the same or nearly the same
meaning as another word or phrase in the same language,
 Or a word that sounds the same or is spelled the same as another word
but has a different meaning, Homonyms or homo·nym
 Homonyms: a word that sounds the same or is spelled the same as
another word but has a different meaning,
 How can it be achieved such that for the same meaning the identical
terms are used in the index and the query?
Thesaurus/Thesauri

 A thesaurus is a type of vocabulary that lists words and


their synonyms, antonyms, and related words.
– Synonyms - words that have the same or similar meaning: Example:
Happy, Joyful, Cheerful
– Antonyms - words that have opposite meanings:
– Example: Hot, Cold
– Homonyms - words that have the same spelling or pronunciation as
another word but have a different meaning:
– Example: Bat (the animal) and bat (the sports equipment)
Thesaurus/Thesauri

 Thesaurus: The vocabulary of a controlled indexing language,


formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.

 A thesaurus contains terms and relationships between terms.


 IR thesauri rely typically upon the use of symbols such as:
USE/UF (UF=used for), BT(Broader Term), and RT(Related
term) to demonstrate inter-term relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
 -color = colour, paint
Aim of Thesaurus

 Thesaurus tries to control the use of the vocabulary by


showing a set of related words to handle synonyms and
homonyms.
 The aim of thesaurus is therefore:
 To provide a standard vocabulary for indexing and searching.
 Thesaurus rewrite to form equivalence classes, and we index such
equivalences.
 When the document contains automobile, index it under car as
well (usually, also vice-versa)
 To assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query.
 To provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs.
Thesaurus Construction

 Example: thesaurus built to assist IR for searching cars and


vehicles :
 Term: Motor vehicles:
 UF : Automobiles (UF - Use For)
 Cars
 Trucks
 BT: Vehicles (BT – Broader Term)
 RT: Road Engineering (RT - Related Term)
 Road Transport
Thesaurus Construction in IR

 Here's an example of how a thesaurus can be constructed in IR:


 Suppose we have a collection of documents related to gardening, and we
want to construct a thesaurus to help users find information more easily.
We can start by manually identifying a set of keywords related to
gardening, such as "plant", "flower", "soil", "water", "fertilizer",
"garden", "greenhouse", etc.
 For each keyword, we can then identify its synonyms, antonyms, and
related terms, as follows:
 "plant": synonyms: "seedling", "sapling", "shrub", "tree"; antonyms: "weed";
related terms: "transplant", "prune", "graft", "propagate".
 "flower": synonyms: "blossom", "bloom", "petal", "bud"; antonyms: "weed";
related terms: "arrangement", "perfume", "color", "pollination".
 "soil": synonyms: "earth", "dirt", "ground", "loam"; antonyms: "rock"; related
terms: "fertility", "texture", "acidity", "composition".
 "water": synonyms: "moisture", "rain", "irrigation"; antonyms: "drought"; related terms: "drainage",
"absorption", "evaporation", "condensation".
Thesaurus Construction in IR

 "fertilizer": synonyms: "manure", "compost", "mulch"; antonyms: "pesticide";


related terms: "nutrients", "application", "organic", "inorganic".
 "garden": synonyms: "yard", "lawn", "park"; antonyms: "desert"; related terms:
"landscape", "design", "maintenance", "harvest".
 "greenhouse": synonyms: "hothouse", "conservatory", "nursery"; antonyms:
"outdoor"; related terms: "temperature", "humidity", "ventilation", "lighting".
Once we have identified the synonyms, antonyms, and related terms for
each keyword,
we can organize them into a hierarchical structure, where broader terms
(e.g., "plant") are linked to narrower terms (e.g., "seedling") and
related terms (e.g., "transplant").
This hierarchical structure can be used to expand or narrow a search
based on the user's query, for example, by including synonyms or
related terms in the search to improve recall or by using narrower
terms to improve precision.
More Example

 Example: thesaurus built to assist IR in the fields of computer


science:
 TERM: natural languages.
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 NT languages (NT= Narrower related term)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
 computational linguistic
 formal languages
 query languages
 speech recognition
Language-specificity

 Many of the above features embody transformations that are:


 Language-specific and
 Often, application-specific.

 These are “plug-in” adgenda to the indexing process.


 Both open source and commercial plug-ins are available for
handling these.
Index Term Selection

 Index language is the language used to describe documents


and requests.

 Elements of the index language are index terms which may


be derived from the text of the document to be described, or
may be arrived at independently.
 If a full text representation of the text is adopted, then all words
in the text are used as index terms = full text indexing.
 Otherwise, need to select the words to be used as index terms
for reducing the size of the index file which is basic to design an
efficient searching IR system.
Question & Answer

04/01/25 38
Thank You !!!

04/01/25 39

You might also like