0% found this document useful (0 votes)
22 views

2 Text Operations

Uploaded by

halal.army07
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

2 Text Operations

Uploaded by

halal.army07
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 2 : Text Operations

Adama Science and Technology University


School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Statistical Properties of Text

 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a


corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.

 A few words are very common.


 2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only
once”.
Sample Word Frequency Data
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, one needs to preprocess the text of a document in a collection
to be used as index terms.
 Using the set of all words in a collection to index documents creates too
much noise for the retrieval task.
 Reduce noise means reduce words which can be used to refer to the
document.
 Preprocessing is the process of controlling the size of the vocabulary or
the number of distinct words used as index terms.
 Preprocessing will lead to an improvement in the information retrieval
performance.
 However, some search engines on the Web omit preprocessing.
 Every word in the document is an index term.
Text Operations …

 Text operations is the process of text transformations in to


logical representations.

 The main operations for selecting index terms, i.e. to choose


words/stems (or groups of words) to be used as indexing terms
are:
 Lexical analysis/Tokenization of the text:- digits, hyphens,
punctuations marks, and the case of letters.
 Elimination of stop words:- filter out words which are not useful
in the retrieval process.

 Stemming words:- remove affixes (prefixes and suffixes)


 Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms.
Generating Document
Representatives
 Text Processing System:
 Input text:- full text, abstract or title.
 Output:- a document representative adequate for use in an
automatic retrieval system.
 The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text.
 A document will be indexed by a name if one of its significant
words occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus

Index
terms
Lexical Analysis/Tokenization of
Text
 Change text of the documents into words to be adopted as index
terms.
 Objective - identify words in the text:
 Digits, hyphens, punctuation marks, case of letters.
 Numbers are not good index terms (like 1910, 1999); but 510 B.C.
– unique.
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens.
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe.
 Case of letters – not important and can convert all to upper or
lower.
Tokenization

 Analyze text into a sequence of discrete tokens (words).


 Input: “Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are


grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing.
 But what are valid tokens to omit?
Issues in Tokenization

 One word or multiple: How do you decide it is one token or


two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 State-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Arba Minch
 lowercase, lower-case, lower case ?
 Data base, database, data-base
 Numbers:
 Dates (3/12/19 vs. Mar. 12, 2019);
 Phone numbers,
 IP addresses (100.2.86.144)
Issues in Tokenization

 How to handle special cases involving apostrophes, hyphens


etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a token.
 However, frequently they are not.

 Simplest approach is to ignore all numbers and punctuation and


use only case-insensitive unbroken strings of alphabetic
characters as tokens.
 Generally, don’t index numbers as text, But often very useful.
 Will often index “meta-data” , including creation date, format, etc.
separately.
 Issues of tokenization are language specific.
 Requires the language to be known.
Exercise: Tokenization

 The cat slept peacefully in the living room. It’s a very old cat.

 Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
Elimination of STOPWORD

 Stopwords are extremely common words across document collections that


have no discriminatory power.
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms.
 Examples of stopwords are articles, pronouns, prepositions, conjunctions,
etc.:
 Articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 Conjunctions/connectors (and, but, for, nor, or, so, yet),
 Verbs (is, are, was, were),
 Adverbs (here, there, out, because, soon, after) and
 Adjectives (all, any, each, every, few, many, some) can also be treated as stopwords.
 Stopwords are language dependent.
Stop words

 Intuition:
 Stopwords have little semantic content; It is typical to remove
such high-frequency words.
 Stopwords take up 50% of the text. Hence, document size reduces
by 30-50%.

 Smaller indices for information retrieval.


 Good compression techniques for indices: The 30 most common
words account for 30% of the tokens in written text.
 Better approximation of importance for classification, summarization,
etc.
How to determine a list of Stop words?

 One method: Sort terms (in decreasing order) by collection frequency


and take the most frequent ones.
 Problem: In a collection about insurance practices, “insurance”
would be a stop word.

 Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.

 With the removal of stopwords, we can measure better


approximation of importance for classification, summarization, etc.
Stop words

 Stop word elimination used to be standard in older IR systems.

 But the trend is getting away from doing this. Most web search
engines index stop words:
 Good query optimization techniques mean you pay little at query
time for including stop words.
 You need stop words for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stemming/Morphological Analysis

 Stemming reduces tokens to their “root” form of words to recognize


morphological variation.
 The process involves removal of affixes (i.e. prefixes and suffixes) with
the aim of reducing variants to the same stem.
 Often removes inflectional and derivational morphology of a word.
 Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy.
 Stemming is language dependent:
 Correct stemming is language specific and can be complex.

for example compressed and for example compress and


compression are both accepted. compress are both accept
Stemming

 Stemming is the process of reducing inflected (or sometimes derived)


words to their word stem.
 A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
 Thus, [automate, automatic, automation] all reduce to 
automat
 A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
 A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
 Queries : Queries are handled in the same way.
Ways to implement stemming

 There are basically two ways to implement stemming.


 The first approach is to create a big dictionary that maps words to
their stems.
 The advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
 The second approach is to use a set of rules that extract stems from
words.
 The advantages of this approach are that the code is typically small, and
it can gracefully handle new words; the disadvantage is that it
occasionally makes mistakes.
 But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
Porter Stemmer

 Stemming is the operation of stripping the suffices from a word, leaving


its stem.
 Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.

 In 1979, Martin Porter developed a stemming algorithm that uses a set


of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
 Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer

 It is the most common algorithm for stemming English words to


their common grammatical root.
 It uses a simple procedure for removing known affixes in English
without using a dictionary. To gets rid of plurals the following
rules are used:
 SSES  SS caresses  caress
 IES  i ponies  poni
 SS  SS caress → caress
 S   (nil) cats  cat
 EMENT   (Delete final element if what remains is longer than
1 character )
 replacement  replac
 cement  cement
Porter stemmer

 While step 1a gets rid of plurals, step 1b removes -ed or -ing.


– e.g.
;; agreed -> agree
– ;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
;; feed -> feed
Stemming: challenges

 May produce unusual stems that are not English words:


 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are


actually distinct.
 “computer”, “computational”, “computation” all reduced to same
token “comput”

 Not recognize all morphological derivations.


Assignment

 In a group of three study the porter’s stemming algorithm in


detail and implement using python.
Thesauri

 Mostly full-text searching cannot be accurate, since different


authors may select different words to represent the same
concept.
 Problem: The same meaning can be expressed using different
terms that are synonyms.
 Synonym: a word or phrase which has the same or nearly the same
meaning as another word or phrase in the same language,
 Or a word that sounds the same or is spelled the same as another
word but has a different meaning, Hononyms.
 Homonyms: a word that sounds the same or is spelled the same as
another word but has a different meaning,
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
Thesauri

 Thesaurus: The vocabulary of a controlled indexing language,


formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.

 A thesaurus contains terms and relationships between terms.


 IR thesauri rely typically upon the use of symbols such as:
USE/UF (UF=used for), BT(Broader Term), and RT(Related
term) to demonstrate inter-term relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
 -color = colour, paint
Aim of Thesaurus

 Thesaurus tries to control the use of the vocabulary by


showing a set of related words to handle synonyms and
homonyms.
 The aim of thesaurus is therefore:
 To provide a standard vocabulary for indexing and searching.
 Thesaurus rewrite to form equivalence classes, and we index such
equivalences.
 When the document contains automobile, index it under car as
well (usually, also vice-versa)
 To assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query.
 To provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs.
Thesaurus Construction

 Example: thesaurus built to assist IR for searching cars and


vehicles :
 Term: Motor vehicles:
 UF : Automobiles (UF - Use For)
 Cars
 Trucks
 BT: Vehicles (BT – Broader Term)
 RT: Road Engineering (RT - Related Term)
 Road Transport
More Example

 Example: thesaurus built to assist IR in the fields of computer


science:
 TERM: natural languages.
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 NT languages (NT= Narrower related term)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
 computational linguistic
 formal languages
 query languages
 speech recognition
Language-specificity

 Many of the above features embody transformations that are:


 Language-specific and
 Often, application-specific.

 These are “plug-in” adgenda to the indexing process.


 Both open source and commercial plug-ins are available for
handling these.
Index Term Selection

 Index language is the language used to describe documents


and requests.

 Elements of the index language are index terms which may


be derived from the text of the document to be described, or
may be arrived at independently.
 If a full text representation of the text is adopted, then all words
in the text are used as index terms = full text indexing.
 Otherwise, need to select the words to be used as index terms
for reducing the size of the index file which is basic to design an
efficient searching IR system.
Question & Answer

03/28/24 31
Thank You !!!

03/28/24 32

You might also like