0% found this document useful (0 votes)

22 views

2 Text Operations

Uploaded by

halal.army07

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

2 Text Operations

Uploaded by

halal.army07

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter 2 : Text Operations

Adama Science and Technology University

School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Statistical Properties of Text

 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a

corpus?
 Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.

 A few words are very common.

 2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences.
 Most words are very rare.
 Half the words in a corpus appear only once, called “read only
once”.
Sample Word Frequency Data
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document.
 Some word carry more meaning than others.
 Noun words are the most representative of a document content.
 Therefore, one needs to preprocess the text of a document in a collection
to be used as index terms.
 Using the set of all words in a collection to index documents creates too
much noise for the retrieval task.
 Reduce noise means reduce words which can be used to refer to the
document.
 Preprocessing is the process of controlling the size of the vocabulary or
the number of distinct words used as index terms.
 Preprocessing will lead to an improvement in the information retrieval
performance.
 However, some search engines on the Web omit preprocessing.
 Every word in the document is an index term.
Text Operations …

 Text operations is the process of text transformations in to

logical representations.

 The main operations for selecting index terms, i.e. to choose

words/stems (or groups of words) to be used as indexing terms
are:
 Lexical analysis/Tokenization of the text:- digits, hyphens,
punctuations marks, and the case of letters.
 Elimination of stop words:- filter out words which are not useful
in the retrieval process.

 Stemming words:- remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms.
Generating Document
Representatives
 Text Processing System:
 Input text:- full text, abstract or title.
 Output:- a document representative adequate for use in an
automatic retrieval system.
 The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text.
 A document will be indexed by a name if one of its significant
words occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus

Index
terms
Lexical Analysis/Tokenization of
Text
 Change text of the documents into words to be adopted as index
terms.
 Objective - identify words in the text:
 Digits, hyphens, punctuation marks, case of letters.
 Numbers are not good index terms (like 1910, 1999); but 510 B.C.
– unique.
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens.
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe.
 Case of letters – not important and can convert all to upper or
lower.
Tokenization

 Analyze text into a sequence of discrete tokens (words).

 Input: “Friends, Romans and Countrymen”

 Output: Tokens (an instance of a sequence of characters that are

grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen
 Each such token is now a candidate for an index entry, after
further processing.
 But what are valid tokens to omit?
Issues in Tokenization

 One word or multiple: How do you decide it is one token or

two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 State-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Arba Minch
 lowercase, lower-case, lower case ?
 Data base, database, data-base
 Numbers:
 Dates (3/12/19 vs. Mar. 12, 2019);
 Phone numbers,
 IP addresses (100.2.86.144)
Issues in Tokenization

 How to handle special cases involving apostrophes, hyphens

etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a token.
 However, frequently they are not.

 Simplest approach is to ignore all numbers and punctuation and

use only case-insensitive unbroken strings of alphabetic
characters as tokens.
 Generally, don’t index numbers as text, But often very useful.
 Will often index “meta-data” , including creation date, format, etc.
separately.
 Issues of tokenization are language specific.
 Requires the language to be known.
Exercise: Tokenization

 The cat slept peacefully in the living room. It’s a very old cat.

 Mr. O’Neill thinks that the boys’ stories about Chile’s capital
aren’t amusing.
Elimination of STOPWORD

 Stopwords are extremely common words across document collections that

have no discriminatory power.
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms.
 Examples of stopwords are articles, pronouns, prepositions, conjunctions,
etc.:
 Articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 Conjunctions/connectors (and, but, for, nor, or, so, yet),
 Verbs (is, are, was, were),
 Adverbs (here, there, out, because, soon, after) and
 Adjectives (all, any, each, every, few, many, some) can also be treated as stopwords.
 Stopwords are language dependent.
Stop words

 Intuition:
 Stopwords have little semantic content; It is typical to remove
such high-frequency words.
 Stopwords take up 50% of the text. Hence, document size reduces
by 30-50%.

 Smaller indices for information retrieval.

 Good compression techniques for indices: The 30 most common
words account for 30% of the tokens in written text.
 Better approximation of importance for classification, summarization,
etc.
How to determine a list of Stop words?

 One method: Sort terms (in decreasing order) by collection frequency

and take the most frequent ones.
 Problem: In a collection about insurance practices, “insurance”
would be a stop word.

 Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare and
exclude from index terms entirely the commonest words.

 With the removal of stopwords, we can measure better

approximation of importance for classification, summarization, etc.
Stop words

 Stop word elimination used to be standard in older IR systems.

 But the trend is getting away from doing this. Most web search
engines index stop words:
 Good query optimization techniques mean you pay little at query
time for including stop words.
 You need stop words for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stop words might reduce recall (e.g. “To be or not
to be” – all eliminated except “be” – no or irrelevant retrieval)
Stemming/Morphological Analysis

 Stemming reduces tokens to their “root” form of words to recognize

morphological variation.
 The process involves removal of affixes (i.e. prefixes and suffixes) with
the aim of reducing variants to the same stem.
 Often removes inflectional and derivational morphology of a word.
 Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy.
 Stemming is language dependent:
 Correct stemming is language specific and can be complex.

for example compressed and for example compress and

compression are both accepted. compress are both accept
Stemming

 Stemming is the process of reducing inflected (or sometimes derived)

words to their word stem.
 A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
 Thus, [automate, automatic, automation] all reduce to 
automat
 A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.
 A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
 Queries : Queries are handled in the same way.
Ways to implement stemming

 There are basically two ways to implement stemming.

 The first approach is to create a big dictionary that maps words to
their stems.
 The advantage of this approach is that it works perfectly (insofar as the
stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
 The second approach is to use a set of rules that extract stems from
words.
 The advantages of this approach are that the code is typically small, and
it can gracefully handle new words; the disadvantage is that it
occasionally makes mistakes.
 But, since stemming is imperfectly defined, anyway, occasional mistakes
are tolerable, and the rule-based approach is the one that is generally
chosen.
Porter Stemmer

 Stemming is the operation of stripping the suffices from a word, leaving

its stem.
 Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection and
connections when users ask for a web page that contains the word
connect.

 In 1979, Martin Porter developed a stemming algorithm that uses a set

of rules to extract stems from words, and though it makes some
mistakes, most common words seem to work out right.
 Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer

 It is the most common algorithm for stemming English words to

their common grammatical root.
 It uses a simple procedure for removing known affixes in English
without using a dictionary. To gets rid of plurals the following
rules are used:
 SSES  SS caresses  caress
 IES  i ponies  poni
 SS  SS caress → caress
 S   (nil) cats  cat
 EMENT   (Delete final element if what remains is longer than
1 character )
 replacement  replac
 cement  cement
Porter stemmer

 While step 1a gets rid of plurals, step 1b removes -ed or -ing.

– e.g.
;; agreed -> agree
– ;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
;; feed -> feed
Stemming: challenges

 May produce unusual stems that are not English words:

 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are

actually distinct.
 “computer”, “computational”, “computation” all reduced to same
token “comput”

 Not recognize all morphological derivations.

Assignment

 In a group of three study the porter’s stemming algorithm in

detail and implement using python.
Thesauri

 Mostly full-text searching cannot be accurate, since different

authors may select different words to represent the same
concept.
 Problem: The same meaning can be expressed using different
terms that are synonyms.
 Synonym: a word or phrase which has the same or nearly the same
meaning as another word or phrase in the same language,
 Or a word that sounds the same or is spelled the same as another
word but has a different meaning, Hononyms.
 Homonyms: a word that sounds the same or is spelled the same as
another word but has a different meaning,
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
Thesauri

 Thesaurus: The vocabulary of a controlled indexing language,

formally organized so that a priori relationships between
concepts (for example as "broader" and “related") are made
explicit.

 A thesaurus contains terms and relationships between terms.

 IR thesauri rely typically upon the use of symbols such as:
USE/UF (UF=used for), BT(Broader Term), and RT(Related
term) to demonstrate inter-term relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
 -color = colour, paint
Aim of Thesaurus

 Thesaurus tries to control the use of the vocabulary by

showing a set of related words to handle synonyms and
homonyms.
 The aim of thesaurus is therefore:
 To provide a standard vocabulary for indexing and searching.
 Thesaurus rewrite to form equivalence classes, and we index such
equivalences.
 When the document contains automobile, index it under car as
well (usually, also vice-versa)
 To assist users with locating terms for proper query formulation:
When the query contains automobile, look under car as well for
expanding query.
 To provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs.
Thesaurus Construction

 Example: thesaurus built to assist IR for searching cars and

vehicles :
 Term: Motor vehicles:
 UF : Automobiles (UF - Use For)
 Cars
 Trucks
 BT: Vehicles (BT – Broader Term)
 RT: Road Engineering (RT - Related Term)
 Road Transport
More Example

 Example: thesaurus built to assist IR in the fields of computer

science:
 TERM: natural languages.
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 NT languages (NT= Narrower related term)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
 computational linguistic
 formal languages
 query languages
 speech recognition
Language-specificity

 Many of the above features embody transformations that are:

 Language-specific and
 Often, application-specific.

 These are “plug-in” adgenda to the indexing process.

 Both open source and commercial plug-ins are available for
handling these.
Index Term Selection

 Index language is the language used to describe documents

and requests.

 Elements of the index language are index terms which may

be derived from the text of the document to be described, or
may be arrived at independently.
 If a full text representation of the text is adopted, then all words
in the text are used as index terms = full text indexing.
 Otherwise, need to select the words to be used as index terms
for reducing the size of the index file which is basic to design an
efficient searching IR system.
Question & Answer

03/28/24 31
Thank You !!!

03/28/24 32

The Path To Prescription
No ratings yet
The Path To Prescription
7 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
CL_lec 6
No ratings yet
CL_lec 6
28 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
3-More on Indexing & Text Operations
No ratings yet
3-More on Indexing & Text Operations
27 pages
Lec 19
No ratings yet
Lec 19
60 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
38 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
lecture2-dictionary
No ratings yet
lecture2-dictionary
37 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Text Mining
No ratings yet
Text Mining
34 pages
03Text Processing
No ratings yet
03Text Processing
22 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
34 pages
ir manual
No ratings yet
ir manual
53 pages
Unit-Ii Text and Web Page Pre-Processing: Stop Words
No ratings yet
Unit-Ii Text and Web Page Pre-Processing: Stop Words
23 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
chapter two IR
No ratings yet
chapter two IR
44 pages
Session 1
No ratings yet
Session 1
33 pages
2_text operation
No ratings yet
2_text operation
35 pages
Text Mining
No ratings yet
Text Mining
62 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
Text Processing, Tokenization & Characteristics
No ratings yet
Text Processing, Tokenization & Characteristics
89 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
CH 2_text operation
No ratings yet
CH 2_text operation
38 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
5 The Term Vocabulary & Posting List
No ratings yet
5 The Term Vocabulary & Posting List
36 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
IR....
No ratings yet
IR....
5 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
text-processing
No ratings yet
text-processing
114 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
chap2part2
No ratings yet
chap2part2
20 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
2&3 Text Operation
No ratings yet
2&3 Text Operation
65 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
CH 5 - Pushdown Automata
No ratings yet
CH 5 - Pushdown Automata
32 pages
CH 4 - Context Free Languages Amd Grammars
No ratings yet
CH 4 - Context Free Languages Amd Grammars
86 pages
CH 1 - Introduction To FLAT
No ratings yet
CH 1 - Introduction To FLAT
41 pages
CH 3 - Regular Languages Amd Regular Grammars
No ratings yet
CH 3 - Regular Languages Amd Regular Grammars
67 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 Retrieval Effectiveness
No ratings yet
5 Retrieval Effectiveness
20 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Ficha - Tecnica - Estructura - para - Paneles - Solares
No ratings yet
Ficha - Tecnica - Estructura - para - Paneles - Solares
64 pages
Sequential Circuits: Latches and Flip-Flops
No ratings yet
Sequential Circuits: Latches and Flip-Flops
32 pages
Singapore and International Standards
No ratings yet
Singapore and International Standards
30 pages
En1335 Part2 Safety Requirement
No ratings yet
En1335 Part2 Safety Requirement
12 pages
Desglose Turbo
No ratings yet
Desglose Turbo
3 pages
The Graph of Functions: Chapter 2.2 & 2.3
No ratings yet
The Graph of Functions: Chapter 2.2 & 2.3
27 pages
Ee PDF 2021-Jun-30 by Parker 66q Vce
No ratings yet
Ee PDF 2021-Jun-30 by Parker 66q Vce
9 pages
Sap 6600 Sap630ft Manual v2.1 en
No ratings yet
Sap 6600 Sap630ft Manual v2.1 en
271 pages
CSC2330 Assignment 3 Problem Solving 2 T1 2024
No ratings yet
CSC2330 Assignment 3 Problem Solving 2 T1 2024
7 pages
Vtscadalevel2 Parte B
No ratings yet
Vtscadalevel2 Parte B
66 pages
190D(Model190)
No ratings yet
190D(Model190)
45 pages
Recommended Velocities (Carrier Cal)
No ratings yet
Recommended Velocities (Carrier Cal)
1 page
Performance Testing of HRC Fuse Links
100% (1)
Performance Testing of HRC Fuse Links
26 pages
Thesis Thailand
100% (3)
Thesis Thailand
6 pages
Topic 6. Technology Answer: I. Exercise A. Đánh giá mức độ phù hợp của những từ hoặc cụm từ in đậm trong các câu sau
0% (2)
Topic 6. Technology Answer: I. Exercise A. Đánh giá mức độ phù hợp của những từ hoặc cụm từ in đậm trong các câu sau
3 pages
Modern Database Management 10th Edition Hoffer Solutions Manual - Quick Download In Full PDF Format With All Chapters
100% (4)
Modern Database Management 10th Edition Hoffer Solutions Manual - Quick Download In Full PDF Format With All Chapters
47 pages
Launching a Real Estate Project
No ratings yet
Launching a Real Estate Project
3 pages
Water Sampling: by Using Robot (Drone)
No ratings yet
Water Sampling: by Using Robot (Drone)
10 pages
Tingkat Kepuasan Konsumen Terhadap Pelayanan KEFARMASIAN - Artikel
No ratings yet
Tingkat Kepuasan Konsumen Terhadap Pelayanan KEFARMASIAN - Artikel
31 pages
John Aiken Resume
No ratings yet
John Aiken Resume
1 page
Principles of Heat Flow
No ratings yet
Principles of Heat Flow
10 pages
BSNL Report
No ratings yet
BSNL Report
26 pages
20-107 Fuel Supply System Biturbo 2.8 PDF
No ratings yet
20-107 Fuel Supply System Biturbo 2.8 PDF
25 pages
Digital Design Principles and Practices 4th Edition John F. Wakerly - The ebook with all chapters is available with just one click
100% (2)
Digital Design Principles and Practices 4th Edition John F. Wakerly - The ebook with all chapters is available with just one click
47 pages
Long Column
No ratings yet
Long Column
10 pages
Kenwood TRC-80 - User Manual PDF
73% (11)
Kenwood TRC-80 - User Manual PDF
33 pages
HONEYWELL - Valvulas Seguridad Agua
No ratings yet
HONEYWELL - Valvulas Seguridad Agua
4 pages
Chapter 4 - Introduction To Business Strategy
No ratings yet
Chapter 4 - Introduction To Business Strategy
59 pages
TAX3761_2025_TL_101_0_B (updated)
No ratings yet
TAX3761_2025_TL_101_0_B (updated)
19 pages

2 Text Operations

Uploaded by

2 Text Operations

Uploaded by

Chapter 2 : Text Operations

Adama Science and Technology University

 How is the frequency of different words distributed?

 How fast does vocabulary size grow with the size of a

 A few words are very common.

 Text operations is the process of text transformations in to

 The main operations for selecting index terms, i.e. to choose

 Stemming words:- remove affixes (prefixes and suffixes)

 Analyze text into a sequence of discrete tokens (words).

 Output: Tokens (an instance of a sequence of characters that are

 One word or multiple: How do you decide it is one token or

 How to handle special cases involving apostrophes, hyphens

 Simplest approach is to ignore all numbers and punctuation and

 Stopwords are extremely common words across document collections that

 Smaller indices for information retrieval.

 One method: Sort terms (in decreasing order) by collection frequency

 With the removal of stopwords, we can measure better

 Stop word elimination used to be standard in older IR systems.

 Stemming reduces tokens to their “root” form of words to recognize

for example compressed and for example compress and

 Stemming is the process of reducing inflected (or sometimes derived)

 There are basically two ways to implement stemming.

 Stemming is the operation of stripping the suffices from a word, leaving

 In 1979, Martin Porter developed a stemming algorithm that uses a set

 It is the most common algorithm for stemming English words to

 While step 1a gets rid of plurals, step 1b removes -ed or -ing.

 May produce unusual stems that are not English words:

 May conflate (reduce to the same token) words that are

 Not recognize all morphological derivations.

 In a group of three study the porter’s stemming algorithm in

 Mostly full-text searching cannot be accurate, since different

 Thesaurus: The vocabulary of a controlled indexing language,

 A thesaurus contains terms and relationships between terms.

 Thesaurus tries to control the use of the vocabulary by

 Example: thesaurus built to assist IR for searching cars and

 Example: thesaurus built to assist IR in the fields of computer

 Many of the above features embody transformations that are:

 These are “plug-in” adgenda to the indexing process.

 Index language is the language used to describe documents

 Elements of the index language are index terms which may

You might also like