SlideShare a Scribd company logo
Chapter 2 : Text Operations
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
 How is the frequency of different words distributed?
 How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
 A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only
once”.
Statistical Properties of Text
Sample Word Frequency Data
 Not all words in a document are equally significant to represent the
contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, one needs to preprocess the text of a document in a collection
to be used as index terms.
 Using the set of all words in a collection to index documents creates too
much noise for the retrieval task.
 Reduce noise means reduce words which can be used to refer to the
document.
 Preprocessing is the process of controlling the size of the vocabulary or
the number of distinct words used as index terms.
 Preprocessing will lead to an improvement in the information retrieval
performance.
 However, some search engines on the Web omit preprocessing.
 Every word in the document is an index term.
Text Operations
 Text operations is the process of text transformations in to
logical representations.
 The main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms
are:
 Lexical analysis/Tokenization of the text:- digits, hyphens,
punctuations marks, and the case of letters.
 Elimination of stop words:- filter out words which are not useful
in the retrieval process.
 Stemming words:- remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms.
Text Operations …
 Text Processing System:
 Input text:- full text, abstract or title.
 Output:- a document representative adequate for use in an
automatic retrieval system.
 The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text.
 A document will be indexed by a name if one of its significant
words occurs as a member of that class.
Generating Document
Representatives
Index
terms
Tokenization stemming Thesaurus
stop words
documents
 Change text of the documents into words to be adopted as index
terms.
 Objective - identify words in the text:
 Digits, hyphens, punctuation marks, case of letters.
 Numbers are not good index terms (like 1910, 1999); but 510 B.C.
– unique.
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens.
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe.
 Case of letters – not important and can convert all to upper or
lower.
Lexical Analysis/Tokenization of
Text
 Analyze text into a sequence of discrete tokens (words).
 Input: “Friends, Romans and Countrymen”
 Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing.
 But what are valid tokens to omit?
Tokenization
 One word or multiple: How do you decide it is one token or
two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 State-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Arba Minch
 lowercase, lower-case, lower case ?
 Data base, database, data-base
 Numbers:
 Dates (3/12/19 vs. Mar. 12, 2019);
 Phone numbers,
 IP addresses (100.2.86.144)
Issues in Tokenization
 How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a token.
 However, frequently they are not.
 Simplest approach is to ignore all numbers and punctuation and
use only case-insensitive unbroken strings of alphabetic
characters as tokens.
 Generally, don’t index numbers as text, But often very useful.
 Will often index “meta-data” , including creation date, format, etc.
separately.
 Issues of tokenization are language specific.
 Requires the language to be known.
Issues in Tokenization
 The cat slept peacefully in the living room. It’s a very old cat.
 Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
Exercise: Tokenization
 Stopwords are extremely common words across document collections that
have no discriminatory power.
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms.
 Examples of stopwords are articles, pronouns, prepositions, conjunctions,
etc.:
 Articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 Conjunctions/connectors (and, but, for, nor, or, so, yet),
 Verbs (is, are, was, were),
 Adverbs (here, there, out, because, soon, after) and
 Adjectives (all, any, each, every, few, many, some) can also be treated as stopwords.
 Stopwords are language dependent.
Elimination of STOPWORD
 Intuition:
 Stopwords have little semantic content; It is typical to remove
such high-frequency words.
 Stopwords take up 50% of the text. Hence, document size reduces
by 30-50%.
 Smaller indices for information retrieval.
 Good compression techniques for indices: The 30 most common
words account for 30% of the tokens in written text.
 Better approximation of importance for classification, summarization,
etc.
Stop words
 One method: Sort terms (in decreasing order) by collection frequency
and take the most frequent ones.
 Problem: In a collection about insurance practices, “insurance”
would be a stop word.
 Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.
 With the removal of stopwords, we can measure better
approximation of importance for classification, summarization, etc.
How to determine a list of Stop words?
 Stop word elimination used to be standard in older IR systems.
 But the trend is getting away from doing this. Most web search
engines index stop words:
 Good query optimization techniques mean you pay little at query
time for including stop words.
 You need stop words for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stop words
 Stemming reduces tokens to their “root” form of words to recognize
morphological variation.
 The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variants to the same stem.
 Often removes inflectional and derivational morphology of a word.
 Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy.
 Stemming is language dependent:
 Correct stemming is language specific and can be complex.
Stemming/Morphological Analysis
for example compressed and
compression are both accepted.
for example compress and
compress are both accept
 Stemming is the process of reducing inflected (or sometimes derived)
words to their word stem.
 A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
 Thus, [automate, automatic, automation] all reduce to 
automat
 A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
 A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
 Queries : Queries are handled in the same way.
Stemming
 There are basically two ways to implement stemming.
 The first approach is to create a big dictionary that maps words to
their stems.
 The advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
 The second approach is to use a set of rules that extract stems from
words.
 The advantages of this approach are that the code is typically small, and it
can gracefully handle new words; the disadvantage is that it occasionally
makes mistakes.
 But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
Ways to implement stemming
 Stemming is the operation of stripping the suffices from a word, leaving
its stem.
 Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.
 In 1979, Martin Porter developed a stemming algorithm that uses a set
of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
 Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter Stemmer
 It is the most common algorithm for stemming English words to
their common grammatical root.
 It uses a simple procedure for removing known affixes in English
without using a dictionary. To gets rid of plurals the following
rules are used:
 SSES  SS caresses  caress
 IES  i ponies  poni
 SS  SS caress → caress
 S   (nil) cats  cat
 EMENT   (Delete final element if what remains is longer than
1 character )
 replacement  replac
 cement  cement
Porter stemmer
 While step 1a gets rid of plurals, step 1b removes -ed or -ing.
– e.g.
;; agreed -> agree
– ;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
;; feed -> feed
Porter stemmer
 May produce unusual stems that are not English words:
 Removing ‘UAL’ from FACTUAL and EQUAL
 May conflate (reduce to the same token) words that are actually
distinct.
 “computer”, “computational”, “computation” all reduced to same
token “comput”
 Not recognize all morphological derivations.
Stemming: challenges
 In a group of three study the porter’s stemming algorithm in
detail and implement using python.
Assignment
 Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept.
 Problem: The same meaning can be expressed using different
terms that are synonyms.
 Synonym: a word or phrase which has the same or nearly the same
meaning as another word or phrase in the same language,
 Or a word that sounds the same or is spelled the same as another
word but has a different meaning, Hononyms.
 Homonyms: a word that sounds the same or is spelled the same as
another word but has a different meaning,
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
Thesauri
 Thesaurus: The vocabulary of a controlled indexing language,
formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.
 A thesaurus contains terms and relationships between terms.
 IR thesauri rely typically upon the use of symbols such as:
USE/UF (UF=used for), BT(Broader Term), and RT(Related
term) to demonstrate inter-term relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
 -color = colour, paint
Thesauri
 Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms.
 The aim of thesaurus is therefore:
 To provide a standard vocabulary for indexing and searching.
 Thesaurus rewrite to form equivalence classes, and we index such
equivalences.
 When the document contains automobile, index it under car as
well (usually, also vice-versa)
 To assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query.
 To provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs.
Aim of Thesaurus
 Example: thesaurus built to assist IR for searching cars and
vehicles :
 Term: Motor vehicles:
 UF : Automobiles (UF - Use For)
 Cars
 Trucks
 BT: Vehicles (BT – Broader Term)
 RT: Road Engineering (RT - Related Term)
 Road Transport
Thesaurus Construction
 Example: thesaurus built to assist IR in the fields of computer
science:
 TERM: natural languages.
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 NT languages (NT= Narrower related term)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
 computational linguistic
 formal languages
 query languages
 speech recognition
More Example
 Many of the above features embody transformations that are:
 Language-specific and
 Often, application-specific.
 These are “plug-in” adgenda to the indexing process.
 Both open source and commercial plug-ins are available for
handling these.
Language-specificity
 Index language is the language used to describe documents
and requests.
 Elements of the index language are index terms which may be
derived from the text of the document to be described, or may
be arrived at independently.
 If a full text representation of the text is adopted, then all words
in the text are used as index terms = full text indexing.
 Otherwise, need to select the words to be used as index terms
for reducing the size of the index file which is basic to design an
efficient searching IR system.
Index Term Selection
Question & Answer
4/9/2024 31
Thank You !!!
4/9/2024 32
Ad

More Related Content

What's hot (20)

Impact of Social Networking /Web 2.0 features in Library Management Software
Impact of Social Networking /Web 2.0 features in Library Management SoftwareImpact of Social Networking /Web 2.0 features in Library Management Software
Impact of Social Networking /Web 2.0 features in Library Management Software
Saptarshi Ghosh
 
Algorithm and pseudo codes
Algorithm and pseudo codesAlgorithm and pseudo codes
Algorithm and pseudo codes
hermiraguilar
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
silambu111
 
Library email services
Library email servicesLibrary email services
Library email services
Dheeraj Negi
 
unit-4.ppt
unit-4.pptunit-4.ppt
unit-4.ppt
MsRAMYACSE
 
Chain indexing
Chain indexingChain indexing
Chain indexing
silambu111
 
1. indexing and abstracting
1. indexing and abstracting1. indexing and abstracting
1. indexing and abstracting
Moses Mbanje
 
What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.What is Document Indexing? A tutorial for intelligent data capture.
What is Document Indexing? A tutorial for intelligent data capture.
DocuFi, offering HAI and Infection Prevention Analytics
 
Institutional Repositories and Open Access Movement
Institutional Repositories and Open Access MovementInstitutional Repositories and Open Access Movement
Institutional Repositories and Open Access Movement
Dept of Library and Information Science Tumkur University
 
Binomial queues
Binomial queuesBinomial queues
Binomial queues
durvasikiran
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
kokanechandrakant
 
Pumping lemma for cfl
Pumping lemma for cflPumping lemma for cfl
Pumping lemma for cfl
Muhammad Zohaib Chaudhary
 
CAS & SDI service
CAS & SDI serviceCAS & SDI service
CAS & SDI service
VBT's Institute of Library and Information Science
 
OPAC 2.0: Supporting library users
OPAC 2.0: Supporting library usersOPAC 2.0: Supporting library users
OPAC 2.0: Supporting library users
Scottish Library & Information Council (SLIC), CILIP in Scotland (CILIPS)
 
Mutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory SystemsMutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory Systems
Dilum Bandara
 
Classical problem of synchronization
Classical problem of synchronizationClassical problem of synchronization
Classical problem of synchronization
Shakshi Ranawat
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
COMPILER DESIGN
COMPILER DESIGNCOMPILER DESIGN
COMPILER DESIGN
Vetukurivenkatashiva
 
Cognitive Retrieval Model
Cognitive Retrieval ModelCognitive Retrieval Model
Cognitive Retrieval Model
Firdaus Rahaman
 
Greedy Algorithms
Greedy AlgorithmsGreedy Algorithms
Greedy Algorithms
Amrinder Arora
 
Impact of Social Networking /Web 2.0 features in Library Management Software
Impact of Social Networking /Web 2.0 features in Library Management SoftwareImpact of Social Networking /Web 2.0 features in Library Management Software
Impact of Social Networking /Web 2.0 features in Library Management Software
Saptarshi Ghosh
 
Algorithm and pseudo codes
Algorithm and pseudo codesAlgorithm and pseudo codes
Algorithm and pseudo codes
hermiraguilar
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
silambu111
 
Library email services
Library email servicesLibrary email services
Library email services
Dheeraj Negi
 
Chain indexing
Chain indexingChain indexing
Chain indexing
silambu111
 
1. indexing and abstracting
1. indexing and abstracting1. indexing and abstracting
1. indexing and abstracting
Moses Mbanje
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
kokanechandrakant
 
Mutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory SystemsMutual Exclusion in Distributed Memory Systems
Mutual Exclusion in Distributed Memory Systems
Dilum Bandara
 
Classical problem of synchronization
Classical problem of synchronizationClassical problem of synchronization
Classical problem of synchronization
Shakshi Ranawat
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
theyaseen51
 
Cognitive Retrieval Model
Cognitive Retrieval ModelCognitive Retrieval Model
Cognitive Retrieval Model
Firdaus Rahaman
 

Similar to Information retrieval chapter 2-Text Operations.ppt (20)

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Textmining
TextminingTextmining
Textmining
sidhunileshwar
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
NALESVPMEngg
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Ir 03
Ir   03Ir   03
Ir 03
Mohammed Romi
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
University of Minnesota, Duluth
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
Waqas Tariq
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
IJCSEA Journal
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
University of Minnesota, Duluth
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
IR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for studentsIR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for students
abduwasiahmed
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
IswaryaPurushothaman1
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
University of Minnesota, Duluth
 
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala UniversityNLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
Andi Wu
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt2_text operationinformation retrieval. ppt
2_text operationinformation retrieval. ppt
HayomeTakele
 
Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...Stemming is one of several text normalization techniques that converts raw te...
Stemming is one of several text normalization techniques that converts raw te...
NALESVPMEngg
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
Semantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with IdiomsSemantic Based Model for Text Document Clustering with Idioms
Semantic Based Model for Text Document Clustering with Idioms
Waqas Tariq
 
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONSIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
IJCSEA Journal
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
IR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for studentsIR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for students
abduwasiahmed
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala UniversityNLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
 
Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
Andi Wu
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
Ad

Recently uploaded (20)

Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
Political History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptxPolitical History of Pala dynasty Pala Rulers NEP.pptx
Political History of Pala dynasty Pala Rulers NEP.pptx
Arya Mahila P. G. College, Banaras Hindu University, Varanasi, India.
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Biophysics Chapter 3 Methods of Studying Macromolecules.pdf
Biophysics Chapter 3 Methods of Studying Macromolecules.pdfBiophysics Chapter 3 Methods of Studying Macromolecules.pdf
Biophysics Chapter 3 Methods of Studying Macromolecules.pdf
PKLI-Institute of Nursing and Allied Health Sciences Lahore , Pakistan.
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - WorksheetCBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
CBSE - Grade 8 - Science - Chemistry - Metals and Non Metals - Worksheet
Sritoma Majumder
 
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdfExploring-Substances-Acidic-Basic-and-Neutral.pdf
Exploring-Substances-Acidic-Basic-and-Neutral.pdf
Sandeep Swamy
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
Introduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe EngineeringIntroduction to Vibe Coding and Vibe Engineering
Introduction to Vibe Coding and Vibe Engineering
Damian T. Gordon
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Social Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy StudentsSocial Problem-Unemployment .pptx notes for Physiotherapy Students
Social Problem-Unemployment .pptx notes for Physiotherapy Students
DrNidhiAgarwal
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
How to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 WebsiteHow to Subscribe Newsletter From Odoo 18 Website
How to Subscribe Newsletter From Odoo 18 Website
Celine George
 
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public SchoolsK12 Tableau Tuesday  - Algebra Equity and Access in Atlanta Public Schools
K12 Tableau Tuesday - Algebra Equity and Access in Atlanta Public Schools
dogden2
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
Ad

Information retrieval chapter 2-Text Operations.ppt

  • 1. Chapter 2 : Text Operations Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Dr. Mesfin Abebe Haile (2024)
  • 2.  How is the frequency of different words distributed?  How fast does vocabulary size grow with the size of a corpus?  Such factors affect the performance of IR system & can be used to select suitable term weights & other aspects of the system.  A few words are very common.  2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.  Most words are very rare.  Half the words in a corpus appear only once, called “read only once”. Statistical Properties of Text
  • 4.  Not all words in a document are equally significant to represent the contents/meanings of a document.  Some word carry more meaning than others.  Noun words are the most representative of a document content.  Therefore, one needs to preprocess the text of a document in a collection to be used as index terms.  Using the set of all words in a collection to index documents creates too much noise for the retrieval task.  Reduce noise means reduce words which can be used to refer to the document.  Preprocessing is the process of controlling the size of the vocabulary or the number of distinct words used as index terms.  Preprocessing will lead to an improvement in the information retrieval performance.  However, some search engines on the Web omit preprocessing.  Every word in the document is an index term. Text Operations
  • 5.  Text operations is the process of text transformations in to logical representations.  The main operations for selecting index terms, i.e. to choose words/stems (or groups of words) to be used as indexing terms are:  Lexical analysis/Tokenization of the text:- digits, hyphens, punctuations marks, and the case of letters.  Elimination of stop words:- filter out words which are not useful in the retrieval process.  Stemming words:- remove affixes (prefixes and suffixes)  Construction of term categorization structures such as thesaurus, to capture relationship for allowing the expansion of the original query with related terms. Text Operations …
  • 6.  Text Processing System:  Input text:- full text, abstract or title.  Output:- a document representative adequate for use in an automatic retrieval system.  The document representative consists of a list of class names, each name representing a class of words occurring in the total input text.  A document will be indexed by a name if one of its significant words occurs as a member of that class. Generating Document Representatives Index terms Tokenization stemming Thesaurus stop words documents
  • 7.  Change text of the documents into words to be adopted as index terms.  Objective - identify words in the text:  Digits, hyphens, punctuation marks, case of letters.  Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique.  Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but some words, e.g. gilt-edged, B-49 - unique words which require hyphens.  Punctuation marks – remove totally unless significant, e.g. program code: x.exe and xexe.  Case of letters – not important and can convert all to upper or lower. Lexical Analysis/Tokenization of Text
  • 8.  Analyze text into a sequence of discrete tokens (words).  Input: “Friends, Romans and Countrymen”  Output: Tokens (an instance of a sequence of characters that are grouped together as a useful semantic unit for processing)  Friends  Romans  and  Countrymen  Each such token is now a candidate for an index entry, after further processing.  But what are valid tokens to omit? Tokenization
  • 9.  One word or multiple: How do you decide it is one token or two or more?  Hewlett-Packard  Hewlett and Packard as two tokens?  State-of-the-art: break up hyphenated sequence.  San Francisco, Los Angeles  Addis Ababa, Arba Minch  lowercase, lower-case, lower case ?  Data base, database, data-base  Numbers:  Dates (3/12/19 vs. Mar. 12, 2019);  Phone numbers,  IP addresses (100.2.86.144) Issues in Tokenization
  • 10.  How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, …  Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.  However, frequently they are not.  Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.  Generally, don’t index numbers as text, But often very useful.  Will often index “meta-data” , including creation date, format, etc. separately.  Issues of tokenization are language specific.  Requires the language to be known. Issues in Tokenization
  • 11.  The cat slept peacefully in the living room. It’s a very old cat.  Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. Exercise: Tokenization
  • 12.  Stopwords are extremely common words across document collections that have no discriminatory power.  They may occur in 80% of the documents in a collection.  They would appear to be of little value in helping select documents matching a user need and needs to be filtered out from potential index terms.  Examples of stopwords are articles, pronouns, prepositions, conjunctions, etc.:  Articles (a, an, the); pronouns: (I, he, she, it, their, his)  Some prepositions (on, of, in, about, besides, against, over),  Conjunctions/connectors (and, but, for, nor, or, so, yet),  Verbs (is, are, was, were),  Adverbs (here, there, out, because, soon, after) and  Adjectives (all, any, each, every, few, many, some) can also be treated as stopwords.  Stopwords are language dependent. Elimination of STOPWORD
  • 13.  Intuition:  Stopwords have little semantic content; It is typical to remove such high-frequency words.  Stopwords take up 50% of the text. Hence, document size reduces by 30-50%.  Smaller indices for information retrieval.  Good compression techniques for indices: The 30 most common words account for 30% of the tokens in written text.  Better approximation of importance for classification, summarization, etc. Stop words
  • 14.  One method: Sort terms (in decreasing order) by collection frequency and take the most frequent ones.  Problem: In a collection about insurance practices, “insurance” would be a stop word.  Another method: Build a stop word list that contains a set of articles, pronouns, etc.  Why do we need stop lists: With a stop list, we can compare and exclude from index terms entirely the commonest words.  With the removal of stopwords, we can measure better approximation of importance for classification, summarization, etc. How to determine a list of Stop words?
  • 15.  Stop word elimination used to be standard in older IR systems.  But the trend is getting away from doing this. Most web search engines index stop words:  Good query optimization techniques mean you pay little at query time for including stop words.  You need stop words for:  Phrase queries: “King of Denmark”  Various song titles, etc.: “Let it be”, “To be or not to be”  “Relational” queries: “flights to London”  Elimination of stop words might reduce recall (e.g. “To be or not to be” – all eliminated except “be” – no or irrelevant retrieval) Stop words
  • 16.  Stemming reduces tokens to their “root” form of words to recognize morphological variation.  The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of reducing variants to the same stem.  Often removes inflectional and derivational morphology of a word.  Inflectional morphology: vary the form of words in order to express grammatical features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.  Derivational morphology: makes new words from old ones. E.g. creation is formed from create , but they are two separate words. And also, destruction → destroy.  Stemming is language dependent:  Correct stemming is language specific and can be complex. Stemming/Morphological Analysis for example compressed and compression are both accepted. for example compress and compress are both accept
  • 17.  Stemming is the process of reducing inflected (or sometimes derived) words to their word stem.  A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes and/or suffixes).  Example: ‘connect’ is the stem for {connected, connecting connection, connections}  Thus, [automate, automatic, automation] all reduce to  automat  A class name is assigned to a document if and only if one of its members occurs as a significant word in the text of the document.  A document representative then becomes a list of class names, which are often referred as the documents index terms/keywords.  Queries : Queries are handled in the same way. Stemming
  • 18.  There are basically two ways to implement stemming.  The first approach is to create a big dictionary that maps words to their stems.  The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear.  The second approach is to use a set of rules that extract stems from words.  The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes.  But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen. Ways to implement stemming
  • 19.  Stemming is the operation of stripping the suffices from a word, leaving its stem.  Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when users ask for a web page that contains the word connect.  In 1979, Martin Porter developed a stemming algorithm that uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right.  Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html Porter Stemmer
  • 20.  It is the most common algorithm for stemming English words to their common grammatical root.  It uses a simple procedure for removing known affixes in English without using a dictionary. To gets rid of plurals the following rules are used:  SSES  SS caresses  caress  IES  i ponies  poni  SS  SS caress → caress  S   (nil) cats  cat  EMENT   (Delete final element if what remains is longer than 1 character )  replacement  replac  cement  cement Porter stemmer
  • 21.  While step 1a gets rid of plurals, step 1b removes -ed or -ing. – e.g. ;; agreed -> agree – ;; disabled -> disable ;; matting -> mat ;; mating -> mate ;; meeting -> meet ;; milling -> mill ;; messing -> mess ;; meetings -> meet ;; feed -> feed Porter stemmer
  • 22.  May produce unusual stems that are not English words:  Removing ‘UAL’ from FACTUAL and EQUAL  May conflate (reduce to the same token) words that are actually distinct.  “computer”, “computational”, “computation” all reduced to same token “comput”  Not recognize all morphological derivations. Stemming: challenges
  • 23.  In a group of three study the porter’s stemming algorithm in detail and implement using python. Assignment
  • 24.  Mostly full-text searching cannot be accurate, since different authors may select different words to represent the same concept.  Problem: The same meaning can be expressed using different terms that are synonyms.  Synonym: a word or phrase which has the same or nearly the same meaning as another word or phrase in the same language,  Or a word that sounds the same or is spelled the same as another word but has a different meaning, Hononyms.  Homonyms: a word that sounds the same or is spelled the same as another word but has a different meaning,  How can it be achieved such that for the same meaning the identical terms are used in the index and the query? Thesauri
  • 25.  Thesaurus: The vocabulary of a controlled indexing language, formally organized so that a priori relationships between concepts (for example as "broader" and “related") are made explicit.  A thesaurus contains terms and relationships between terms.  IR thesauri rely typically upon the use of symbols such as: USE/UF (UF=used for), BT(Broader Term), and RT(Related term) to demonstrate inter-term relationships.  e.g., car = automobile, truck, bus, taxi, motor vehicle  -color = colour, paint Thesauri
  • 26.  Thesaurus tries to control the use of the vocabulary by showing a set of related words to handle synonyms and homonyms.  The aim of thesaurus is therefore:  To provide a standard vocabulary for indexing and searching.  Thesaurus rewrite to form equivalence classes, and we index such equivalences.  When the document contains automobile, index it under car as well (usually, also vice-versa)  To assist users with locating terms for proper query formulation: When the query contains automobile, look under car as well for expanding query.  To provide classified hierarchies that allow the broadening and narrowing of the current request according to user needs. Aim of Thesaurus
  • 27.  Example: thesaurus built to assist IR for searching cars and vehicles :  Term: Motor vehicles:  UF : Automobiles (UF - Use For)  Cars  Trucks  BT: Vehicles (BT – Broader Term)  RT: Road Engineering (RT - Related Term)  Road Transport Thesaurus Construction
  • 28.  Example: thesaurus built to assist IR in the fields of computer science:  TERM: natural languages.  UF natural language processing (UF=used for NLP)  BT languages (BT=broader term is languages)  NT languages (NT= Narrower related term)  TT languages (TT = top term is languages)  RT artificial intelligence (RT=related term/s)  computational linguistic  formal languages  query languages  speech recognition More Example
  • 29.  Many of the above features embody transformations that are:  Language-specific and  Often, application-specific.  These are “plug-in” adgenda to the indexing process.  Both open source and commercial plug-ins are available for handling these. Language-specificity
  • 30.  Index language is the language used to describe documents and requests.  Elements of the index language are index terms which may be derived from the text of the document to be described, or may be arrived at independently.  If a full text representation of the text is adopted, then all words in the text are used as index terms = full text indexing.  Otherwise, need to select the words to be used as index terms for reducing the size of the index file which is basic to design an efficient searching IR system. Index Term Selection