IR Chap3
IR Chap3
Abdo A.
2019/20
1
Outline
Major Steps in Index Construction
Index file Evaluation Metrics
Building Index file
Sequential File
Inverted file
Suffix tree
Suffix Trie
March 8, 2020 2
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
It used to speed up access to desired information from document
March 8, 2020 3
Indexing: Basic Concepts
An index file consists of records, called index entries.
Index files are much smaller than the original file.
Remember Heaps Law: in 1 GB of text collection the
March 8, 2020 4
Major Steps in Index Construction
Source file: Collection of text document
A document can be described by a set of representative
March 8, 2020 5
Major Steps in Index Construction …
Word stem: reduce words with similar meaning into their
stem/root word
Term relevance weight: Different index terms have
Token Tokenizer
stream. Friends Romans countrymen
friend 2 4
roman 1 2
Inverted file countryman 13 16
March 8, 2020
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search key
from the list?
Update time (Insertion time, Deletion time)
How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
Does the indexing structure allows incremental update or re-
indexing?
Space overhead
Computer storage space consumed for keeping the list.
March 8, 2020 8
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?
March 8, 2020 10
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
March 8, 2020
Sequential File
March 8, 2020 13
Sequential File …
order.
Can be searched quickly, using binary search, O(log n)
Its disadvantages:
Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs
March 8, 2020 •Inverted Files 15
Use of Inverted Files for
Calculating Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.
March 8, 2020 16
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms):
collection.
having information about vocabulary (list of terms) speeds
March 8, 2020 17
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization
March 8, 2020 18
Inverted file
Having information about the location of each term within
the document helps for:
19
March 8, 2020
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
March 8, 2020 20
Construction of Inverted file
An inverted index consists of two files: vocabulary and posting
files
A vocabulary file (Word list):
March 8, 2020 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
March 8, 2020 24
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.
Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy
March 8, 2020
Remove stop words and compute
frequency
Multiple term
entries in a
single
document are
merged and
frequency
information
added
March 8, 2020 27
stemming & compute frequency
Multiple term
entries in a
single
document are
merged and
frequency
information
added
28
March 8, 2020
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29 Pointers
Searching on Inverted File
Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
Using binary Search the searching takes logarithmic time
The search is done in the vocabulary lists
March 8, 2020 30
Example: Create Inverted file
Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
March 8, 2020 31
Example: Create Inverted file
After text operation red color terms remain as index
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
March 8, 2020 32
…. Example: Create Inverted file
After text operation performed
D1= department comput science establish
D2= department launch bsc comput study
D3= follow msc comput science start
D4= department produce phd graduat
D5= staff contribut intellect profession advance field
March 8, 2020 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1
follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1
intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1
produce 1 1i 1 1
profession 1 1n 1 1
science 2 2u 2 2
staff 1 1e 1 1
start 1 1 1 1
study 1 1 1 1
Suffix trie
•A suffix trie is an ordinary trie in which the input strings are all
possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each symbol
in a text an index corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• To build the suffix TRIE we use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• We do not have to worry how the text is represented (binary,
ASCII, etc).
• We do not have to store the same object twice (no duplicate).
March 8, 2020 35
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from left to
right as per characters occurrence in the string.
• TEXT: G O O G O L $
POSITION: 1 2 3 4 567
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
March 8, 2020 36
Suffix tree
A suffix tree is an extension of suffix
trie that construct a Trie of all the
proper suffixes of S
The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
We store pointers rather than words in
the leaves.
It is also possible to replace strings
in every edge by a pair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• abab$ } •$ •ab$ •4
•ab$
•3
•2
•1
March 8, 2020 39
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a compressed trie of
all suffixes of s S
•To make suffixes prefix-free we add a special char, $, at the end of s.
•To associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a special
terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
{ •#
•a •$
5. $ 4. # •b
4. b$ 3. b# •# •5 •4
•b
3. ab$ 2. ab# •ab$ •ab$ •$
•3
2. bab$ 1. aab#
•ab$ •$ •# •1
•4
1. abab$ •2
} 8, 2020 •1 •3 40 •2
March
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
Pseudo-code for searching in suffix tree:
Start at root
March 8, 2020 41
Search in suffix tree
Example:
1. Find GO
2. Find OR
If S = "GO" we take the GO
path and return:
GOOGOL$,GOL$.
If S = "OR" we take the O path
and then we hit a NIL pointer so
"OR" is not in the tree.
March 8, 2020 42
Exercise
March 8, 2020 43
Suffix Tree Applications
Suffix Tree can be used to solve a large number of string
problems that occur in:
text-editing,
free-text search,
etc.
String matching
Palindromes
March8,etc..
2020 44
Complexity Analysis
The suffix tree for a string has been built in O(n2) time.
Searching is very fast: The search time is linear in the length of
string S.
The number of leaves is n+1, where n is the number of input
strings.
Furthermore, in the leaves, we may store either the strings
March 8, 2020 45
Thank you
March 8, 2020 46