Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
•1
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reading memory space requirement
used to speed up access to desired information from
document collection as per users query such that
It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
Index file usually has index terms in a sorted order. Which list is
easier to search?
fox pig zebra hen ant cat dog lion ox
ant cat dog fox hen lion ox pig zebra
An index file consists of records, called index entries.
Index files are much smaller than the original file.
Remember Heaps Law: in 1 GB of text collection the vocabulary
has a size of only 5 MB. This size may be further reduced by
Linguistic pre-processing (or text operations).
The usual unit for indexing is the word
Index terms - are used to look up records in a file.
Major Steps in Index Construction
Source file: Collection of text document
A document can be described by a set of representative keywords called index
terms.
Index Terms Selection: apply text operations or preprocessing
Tokenize: identify words in a document, so that each document is represented
by a list of keywords or attributes
Stop words removal: words with high frequency are non-content bearing and
needs to be removed from text collection
Word stem: reduce words with similar meaning into their stem/root word
Term relevance weight: Different index terms have varying relevance when
used to describe document contents. This effect is captured through the assignment of
numerical weights to each index term of a document. There are different index terms
weighting methods: including TF, IDF,TF*IDF, …
Token Tokenizer
stream. Friends Romans countrymen
friend 2 4
roman 1 2
Inverted file countryman 13 16
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search key
from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an attempt
to add new terms or delete existing unnecessary terms?
Is the indexing structure allows incremental update or
reindexing?
Space overhead
Computer storage space consumed for keeping the list .
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
Where within each document does each term appear? (There may
be several occurrences.)
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did 1
Sequential file
enact 1
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
generate index so 2
let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index terms with 2
caesar 2 8 enact 1
in sequential file the 2
9 julius 1
are sorted in noble
brutus
2
2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
To access records search serially;
starting at the first record read and investigate all the succeeding
records until the required record is found or end of the file is
reached.
Its main advantages:
easy to implement;
provides fast access to the next record using lexicographic order.
Can be searched quickly, using binary search, O(log n)
Update options: Is the index needs to be rebuilt or incremental
update is supported?
Its disadvantages:
No weights attached to terms.
Random access is slow: since similar terms are indexed individually,
we need to find all terms that match with the query
Inverted file
A word oriented indexing mechanism based on sorted list
of keywords, with each keyword having links to the
documents containing it
Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.
Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs
•Inverted Files
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms): is the set of all distinct
words (index terms) in the text collection.
Having information about vocabulary (list of terms) speeds searching
for relevant documents
For each term: it contains information related to
Location: all the text locations/positions where the word occurs
frequency of occurrence of terms in a document collection
TF , number of occurrences of term t in document d
ij j i
DF , number of documents containing t
j j
CF, total frequency of t in the corpus n
j
m , maximum frequency of any term in d
i i
n, total number of documents in a collection
….
Inverted file
Having information about the location of each term within the
document helps for:
user interface design: highlight location of search term
proximity based ranking: adjacency and near operators (in Boolean
searching)
Records kept for each term j in the word list contains the
following:
term j
number of documents in which term j occurs (DF )
j
Collection frequency of term j
pointer to inverted (postings) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
It is stored as a separate inverted list for each column, i.e.,
a list corresponding to each term in the index file.
Each list consists of one or many individual postings
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term Doc #
Term Doc # ambitious 2
I 1 be 2
After all documents did 1 brutus 1
enact 1 brutus 2
have been tokenized julius 1 capitol 1
the inverted file is caesar
I
1
1
caesar
caesar
1
2
sorted by terms was
killed
1
1
caesar 2
did 1
I 1 enact 1
the 1 has 1
capitol 1 I 1
brutus 1 I 1
killed 1 I 1
me 1 it 2
so 2
julius 1
let 2
killed 1
it 2
killed 1
be 2
let 2
with 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stop words, stemming & compute
frequency
Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
Counting enact 1 1
enact 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Searching on Inverted File
Since the whole index file is divided into two, searching
can be done faster by loading vocabulary list which takes
less memory even for large document collection
Using binary Search the searching takes logarithmic time
The search is in the vocabulary lists
If txt=t t ...t ...t is a string, then T =t , t ...t is the suffix of txt that starts at
1 2 i n i i i+1 n
position i, where 1≤ i ≤ n
Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi; Exercise: generate suffix of “technology” ?
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.
•The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
• TEXT: GOOGOL$
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
• This structure is
particularly useful
for any application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
A suffix tree is an extension of
suffix trie that construct a Trie of
all the proper suffixes of S
The suffix tree is created by •O
compacting unary nodes of the
suffix TRIE.
We store pointers rather than
words in the leaves.
It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a
compressed trie of all suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• •$ •ab$ •4
•ab$
abab$ } •3
•2
•1
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s S
• To make suffixes prefix-free we add a special char, $, at the end of
s. To associate each suffix with a unique string in S add a different
special symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#'
are a special terminator for s1,s2.
• Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•a •$ •#
•{ •b
• $ # •# •5 •4
• b$ b# •b
•ab$ •ab$ •$
• ab$ ab# •3
• bab$ aab# •ab$ •# •4
•$ •1
•2
• abab$ •1 •2
•3
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
Pseudo-code for searching in suffix tree:
Start at root
Go down the tree by taking each time the corresponding path
If S correspond to a node then return all leaves in sub-tree
the places where S can be found are given by the pointers in all the leaves in
the subtree rooted at x.
If S encountered a NIL pointer before reaching the end, then S is
not in the tree
Example:
If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Exercise
Given the following index terms:
worker, word, world, run & information
construct index file using suffix tree?
Suffix Tree Applications
Suffix Tree can be used to solve a large number of string
problems that occur in:
text-editing,
free-text search,
etc.