0% found this document useful (0 votes)

161 views

Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure

The document discusses indexing and inverted files. Indexing is used to speed up document retrieval by organizing index terms. An inverted file consists of a vocabulary list of index terms, each paired with the documents that contain that term. For each term-document pair, additional information may be stored like term frequency and location within the document. Inverted files allow fast retrieval of relevant documents for a query term.

Uploaded by

Aaron Melendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

161 views

Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure

Uploaded by

Aaron Melendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduction to

Information Storage and Retrieval

Chapter Four: Indexing

structure

•1
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reading memory space requirement
used to speed up access to desired information from
document collection as per users query such that
It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
Index file usually has index terms in a sorted order. Which list is
easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
An index file consists of records, called index entries.
Index files are much smaller than the original file.
Remember Heaps Law: in 1 GB of text collection the vocabulary
has a size of only 5 MB. This size may be further reduced by
Linguistic pre-processing (or text operations).
The usual unit for indexing is the word
Index terms - are used to look up records in a file.
Major Steps in Index Construction
Source file: Collection of text document
A document can be described by a set of representative keywords called index
terms.
Index Terms Selection: apply text operations or preprocessing
Tokenize: identify words in a document, so that each document is represented
by a list of keywords or attributes
Stop words removal: words with high frequency are non-content bearing and
needs to be removed from text collection
Word stem: reduce words with similar meaning into their stem/root word
Term relevance weight: Different index terms have varying relevance when
used to describe document contents. This effect is captured through the assignment of
numerical weights to each index term of a document. There are different index terms
weighting methods: including TF, IDF,TF*IDF, …

Indexing structure: a set of index terms (vocabulary) are organized

in Index File to easily identify documents in which each term occurs
in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessor

Index File Indexer

friend 2 4

roman 1 2
Inverted file countryman 13 16
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search key
from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an attempt
to add new terms or delete existing unnecessary terms?
Is the indexing structure allows incremental update or
reindexing?
Space overhead
Computer storage space consumed for keeping the list .
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear? (There may
be several occurrences.)

For organizing index file for a collection of documents,

there are various options available:
Decide what data structure and/or file structure to use. Is it
sequential file, inverted file, suffix tree, etc. ?
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of
some key field.
• a particular attribute is chosen as primary key whose value
will determine the order of the records.
• when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did 1
Sequential file
enact 1
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
generate index so 2
let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index terms with 2
caesar 2 8 enact 1
in sequential file the 2
9 julius 1
are sorted in noble
brutus
2
2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
To access records search serially;
starting at the first record read and investigate all the succeeding
records until the required record is found or end of the file is
reached.
Its main advantages:
easy to implement;
provides fast access to the next record using lexicographic order.
Can be searched quickly, using binary search, O(log n)
Update options: Is the index needs to be rebuilt or incremental
update is supported?
Its disadvantages:
No weights attached to terms.
Random access is slow: since similar terms are indexed individually,
we need to find all terms that match with the query
Inverted file
A word oriented indexing mechanism based on sorted list
of keywords, with each keyword having links to the
documents containing it
 Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
 This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.

Word Extraction Word IDs

Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs

•Inverted Files
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms): is the set of all distinct
words (index terms) in the text collection.
Having information about vocabulary (list of terms) speeds searching
for relevant documents
For each term: it contains information related to
Location: all the text locations/positions where the word occurs
frequency of occurrence of terms in a document collection
TF , number of occurrences of term t in document d
ij j i
DF , number of documents containing t
j j
CF, total frequency of t in the corpus n
j
m , maximum frequency of any term in d
i i
n, total number of documents in a collection
….
Inverted file
Having information about the location of each term within the
document helps for:
user interface design: highlight location of search term
proximity based ranking: adjacency and near operators (in Boolean
searching)

Having information about frequency is used for:

calculating term weighting (like TF, TF*IDF, …)
optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location
term 1 3 2 1 66
This is called
19 1 213 an index file.
29 1 45
term 2 4 3 1 94
19 2 7, 212 Text operations
are performed
22 1 56
before building
term 3 1 5 1 43 the index.
term 4 3 11 2 3, 70
34 1 40

Is it possible to keep all these information during searching?

Construction of Inverted file
An inverted index consists of two files: vocabulary and
posting files
A vocabulary file (Word list):
stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
For each word a pointer to posting file

Records kept for each term j in the word list contains the
following:
term j
number of documents in which term j occurs (DF )
j
Collection frequency of term j
pointer to inverted (postings) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
It is stored as a separate inverted list for each column, i.e.,
a list corresponding to each term in the index file.
Each list consists of one or many individual postings

Advantage of dividing inverted file:

Keeping a pointer in the vocabulary to the list in the
posting file allows:
the vocabulary to be kept in memory at search time even for large
text collection, and
Posting file to be kept on disk for accessing to documents
General structure of Inverted File
The following figure shows the general structure of
inverted index file.
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Pointer
Term DF TF To
posting

term 1 3 3 Inverted
term 2 3 4 lists

term 3 1 1
term 4 2 3
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term Doc #
Term Doc # ambitious 2
I 1 be 2
After all documents did 1 brutus 1
enact 1 brutus 2
have been tokenized julius 1 capitol 1
the inverted file is caesar
I
1
1
caesar
caesar
1
2
sorted by terms was
killed
1
1
caesar 2
did 1
I 1 enact 1
the 1 has 1
capitol 1 I 1
brutus 1 I 1
killed 1 I 1
me 1 it 2
so 2
julius 1
let 2
killed 1
it 2
killed 1
be 2
let 2
with 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stop words, stemming & compute
frequency
Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
Counting enact 1 1
enact 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Searching on Inverted File
Since the whole index file is divided into two, searching
can be done faster by loading vocabulary list which takes
less memory even for large document collection
Using binary Search the searching takes logarithmic time
The search is in the vocabulary lists

Updating inverted file is very complex.

We need to update both vocabulary and posting files
Example: Create Inverted file
Map the file names to file IDs
Consider the following Original Documents

D1 The Department of Computer Science was established in

1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
Suffix tree
What is Suffix? A suffix is a substring that exists at the end of the
given string.
Each position in the text is considered as a text suffix

If txt=t t ...t ...t is a string, then T =t , t ...t is the suffix of txt that starts at
1 2 i n i i i+1 n
position i, where 1≤ i ≤ n
Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi; Exercise: generate suffix of “technology” ?
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
• TEXT: GOOGOL$
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

• This structure is
particularly useful
for any application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
A suffix tree is an extension of
suffix trie that construct a Trie of
all the proper suffixes of S
The suffix tree is created by •O
compacting unary nodes of the
suffix TRIE.
We store pointers rather than
words in the leaves.
It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a
compressed trie of all suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• •$ •ab$ •4
•ab$
abab$ } •3
•2
•1
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
• To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
• Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•a •$ •#
•{ •b
• $ # •# •5 •4
• b$ b# •b
•ab$ •ab$ •$
• ab$ ab# •3
• bab$ aab# •ab$ •# •4
•$ •1
•2
• abab$ •1 •2
•3
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
Pseudo-code for searching in suffix tree:
Start at root
Go down the tree by taking each time the corresponding path
If S correspond to a node then return all leaves in sub-tree
the places where S can be found are given by the pointers in all the leaves in
the subtree rooted at x.
If S encountered a NIL pointer before reaching the end, then S is
not in the tree
Example:
If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Exercise
Given the following index terms:
worker, word, world, run & information
construct index file using suffix tree?
Suffix Tree Applications
Suffix Tree can be used to solve a large number of string
problems that occur in:
text-editing,
free-text search,
etc.

Some examples of string problems are given below.

String matching
Longest Common Substring
Longest Repeated Substring
Palindromes
etc..
Complexity Analysis
The suffix tree for a string has been built in O(n2) time.
Searching is very fast: The search time is linear in the
length of string S.
The number of leaves is n+1, where n is the number of input
strings.
Furthermore, in the leaves, we may store either the strings
themselves or pointers to the strings (that is, integers).
Searching for a substring[1..m], in string[1..n], can be
solved in O(m) time.
Expensive memory-wise
Suffix trees consume a lot of space
How many bytes required to store MISSISSIPI ?

100 NLP Questions
100% (6)
100 NLP Questions
23 pages
CSI 4107 - Winter 2016 - Midterm
0% (1)
CSI 4107 - Winter 2016 - Midterm
10 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR - Models
100% (3)
IR - Models
58 pages
Ai QB
No ratings yet
Ai QB
28 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Text Operations 2021
No ratings yet
Text Operations 2021
45 pages
Information Retrieval MCQ PDF
100% (2)
Information Retrieval MCQ PDF
4 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
Introduction To Information Retrieval-Ch2 Solutions
No ratings yet
Introduction To Information Retrieval-Ch2 Solutions
2 pages
HCI-Lecture-14 - 15
No ratings yet
HCI-Lecture-14 - 15
94 pages
SQL MCQ 05
No ratings yet
SQL MCQ 05
5 pages
CIT 841 TMA 4 Quiz Question
No ratings yet
CIT 841 TMA 4 Quiz Question
3 pages
Sheet 1
No ratings yet
Sheet 1
2 pages
MANAGEMENT-INFORMATION-SYSTEM Questions
No ratings yet
MANAGEMENT-INFORMATION-SYSTEM Questions
28 pages
Chapter 6
No ratings yet
Chapter 6
20 pages
DS 2
No ratings yet
DS 2
70 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
SQL 1
50% (2)
SQL 1
12 pages
Web Crawling
No ratings yet
Web Crawling
10 pages
DBMS MCQs
No ratings yet
DBMS MCQs
27 pages
Knowledge Representation in AI(AGI, Cognitive and Conscious)
No ratings yet
Knowledge Representation in AI(AGI, Cognitive and Conscious)
34 pages
Is Ys 1142014 Exam Solution
0% (1)
Is Ys 1142014 Exam Solution
22 pages
PC Repairing (MCQS)
No ratings yet
PC Repairing (MCQS)
26 pages
$R6RN116
No ratings yet
$R6RN116
20 pages
Database Management Short Notes
No ratings yet
Database Management Short Notes
5 pages
Data Warehousing Mining MCQs
No ratings yet
Data Warehousing Mining MCQs
12 pages
Question Bank For XML
No ratings yet
Question Bank For XML
23 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Advanced Programming-1
100% (1)
Advanced Programming-1
8 pages
Subnetting Practice Answers: Comp 11 - Chapter 10
No ratings yet
Subnetting Practice Answers: Comp 11 - Chapter 10
1 page
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Question and Answer of AI
No ratings yet
Question and Answer of AI
17 pages
250+ TOP MCQs On Database Design Process and Answers
No ratings yet
250+ TOP MCQs On Database Design Process and Answers
7 pages
Cin 628 Week 2 Tutorial
100% (1)
Cin 628 Week 2 Tutorial
6 pages
Introduction To AI: Artificial Intelligence COSC-3112
No ratings yet
Introduction To AI: Artificial Intelligence COSC-3112
29 pages
Untitled
No ratings yet
Untitled
13 pages
Eceg-4221-Vlsi Lec 01 Overview
No ratings yet
Eceg-4221-Vlsi Lec 01 Overview
42 pages
C# Programming Solved MCQs (Set-6)
100% (1)
C# Programming Solved MCQs (Set-6)
6 pages
Frame-Based Expert Systems
No ratings yet
Frame-Based Expert Systems
50 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Introduction To ICT MCQ Exercise - CH20
No ratings yet
Introduction To ICT MCQ Exercise - CH20
2 pages
MCQ Test - C++ MCQ Test - 2
No ratings yet
MCQ Test - C++ MCQ Test - 2
6 pages
Cse357 MCQ
No ratings yet
Cse357 MCQ
28 pages
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
No ratings yet
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
7 pages
Cs101 Solved Mcqs Mega Files For Papers With 115 Papes
No ratings yet
Cs101 Solved Mcqs Mega Files For Papers With 115 Papes
111 pages
(UPDATED 18-SEP-2024) Dynamic Website Development using PHP-7062
100% (1)
(UPDATED 18-SEP-2024) Dynamic Website Development using PHP-7062
8 pages
DBMS Transactions MCQ
No ratings yet
DBMS Transactions MCQ
45 pages
Cwipedia - in-eTI MCQ Emerging Trends in Computer Eng and Information Technology MCQ Chapter 1artificial Intelligen
No ratings yet
Cwipedia - in-eTI MCQ Emerging Trends in Computer Eng and Information Technology MCQ Chapter 1artificial Intelligen
13 pages
Fred Ai2 CSWK
No ratings yet
Fred Ai2 CSWK
7 pages
RDBMS Assignment 1
No ratings yet
RDBMS Assignment 1
5 pages
CS312 Module Wise 1 to 11 Mcqs Midterm 2024 -(vusolutionpoint.com)
No ratings yet
CS312 Module Wise 1 to 11 Mcqs Midterm 2024 -(vusolutionpoint.com)
29 pages
MIS MCQs
No ratings yet
MIS MCQs
10 pages
Enterprise Management MCQS
No ratings yet
Enterprise Management MCQS
6 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
Fake News Detection
No ratings yet
Fake News Detection
5 pages
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
No ratings yet
Not Everything You Read Is True Fake News Detection Using Machine Learning Algorithms
4 pages
An Information-Theoretic Perspective of Tf-Idf Measures: Akiko Aizawa
No ratings yet
An Information-Theoretic Perspective of Tf-Idf Measures: Akiko Aizawa
21 pages
Nlp and Evaluation -Mcq
No ratings yet
Nlp and Evaluation -Mcq
10 pages
UNIT - 5
No ratings yet
UNIT - 5
57 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
8938-Article Text-33591-1-10-20230530
No ratings yet
8938-Article Text-33591-1-10-20230530
9 pages
Natural Language Processing and ML Based Student Mental Health Analysis Using Non Clinical Texts PDF
No ratings yet
Natural Language Processing and ML Based Student Mental Health Analysis Using Non Clinical Texts PDF
53 pages
spam detection
No ratings yet
spam detection
39 pages
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
No ratings yet
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
6 pages
Social Media Mining
No ratings yet
Social Media Mining
10 pages
Cs221 Report
No ratings yet
Cs221 Report
16 pages
Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T
No ratings yet
Information Processing and Management: Junmei Wang, Min Pan, Tingting He, Xiang Huang, Xueyan Wang, Xinhui Tu T
20 pages
1644397192phd Computer Engg
No ratings yet
1644397192phd Computer Engg
42 pages
Vector Semantics
No ratings yet
Vector Semantics
18 pages
User Experiments on the Effect of the Diversity of Consumption on News Services
No ratings yet
User Experiments on the Effect of the Diversity of Consumption on News Services
12 pages
Lecture3 Hadoop-NLP
No ratings yet
Lecture3 Hadoop-NLP
44 pages
Text Analytics
No ratings yet
Text Analytics
30 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
Journal Public
No ratings yet
Journal Public
13 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Content Analysis Words
No ratings yet
Content Analysis Words
32 pages
Combine PDF
No ratings yet
Combine PDF
18 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
PharmaSUG Tokyo 2019 PO02
No ratings yet
PharmaSUG Tokyo 2019 PO02
20 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Fake News Detection
No ratings yet
Fake News Detection
11 pages
Comparative Study of Text Summarization Methods
No ratings yet
Comparative Study of Text Summarization Methods
6 pages

Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure

Uploaded by

Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure

Uploaded by

Introduction to

Information Storage and Retrieval

Chapter Four: Indexing

Indexing structure: a set of index terms (vocabulary) are organized

Modified Linguistic friend roman countryman

Index File Indexer

For organizing index file for a collection of documents,

I did enact Julius

Word Extraction Word IDs

Having information about frequency is used for:

Is it possible to keep all these information during searching?

Advantage of dividing inverted file:

I did enact Julius

Updating inverted file is very complex.

D1 The Department of Computer Science was established in

Some examples of string problems are given below.

You might also like