0% found this document useful (0 votes)
126 views

IR Chap3

The document discusses different indexing structures for organizing terms extracted from documents. It describes sequential files, which list terms alphabetically with their associated documents but lack weights or linking. Inverted files list terms with pointers to their corresponding documents, allowing faster retrieval of relevant documents for a query. The document also mentions suffix trees and tries, which support additional applications like string matching. Overall, the document provides an overview of common indexing structures and their basic functionality.

Uploaded by

biniam teshome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

IR Chap3

The document discusses different indexing structures for organizing terms extracted from documents. It describes sequential files, which list terms alphabetically with their associated documents but lack weights or linking. Inverted files list terms with pointers to their corresponding documents, allowing faster retrieval of relevant documents for a query. The document also mentions suffix trees and tries, which support additional applications like string matching. Overall, the document provides an overview of common indexing structures and their basic functionality.

Uploaded by

biniam teshome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Indexing structure

Abdo A.
2019/20

1
Outline
 Major Steps in Index Construction
 Index file Evaluation Metrics
 Building Index file
 Sequential File

 Inverted file

 Suffix tree

 Suffix Trie

 Suffix Tree Applications

March 8, 2020 2
Indexing: Basic Concepts
 Indexing is an arrangement of index terms to permit fast
searching and reducing memory space requirement
 It used to speed up access to desired information from document

collection as per users query such that


 It enhances efficiency in terms of time for retrieval. Relevant

documents are searched and retrieved quick


 Index file usually has index terms in a sorted order.

 Which list is easier to search?

fox pig zebra hen ant cat dog lion ox


ant cat dog fox hen lion ox pig zebra

March 8, 2020 3
Indexing: Basic Concepts
 An index file consists of records, called index entries.
 Index files are much smaller than the original file.
 Remember Heaps Law: in 1 GB of text collection the

vocabulary has a size of only 5 MB. This size may be


further reduced by Linguistic pre-processing (or text
operations).
 The usual unit for indexing is the word

 Index terms - are used to look up records in a file.

March 8, 2020 4
Major Steps in Index Construction
 Source file: Collection of text document
A document can be described by a set of representative

keywords called index terms.


 Index Terms Selection: apply text operations or
preprocessing
Tokenize: identify words in a document, so that each

document is represented by a list of keywords or attributes


Stop words removal: words with high frequency are non-

content bearing and needs to be removed from text


collection

March 8, 2020 5
Major Steps in Index Construction …
Word stem: reduce words with similar meaning into their

stem/root word
Term relevance weight: Different index terms have

varying relevance when used to describe document


contents. This effect is captured through the assignment
of numerical weights to each index term of a
document. There are different index terms weighting
methods: including TF, IDF, TF*IDF, …
 Indexing structure: a set of index terms (vocabulary) are

organized in Index File to easily identify documents in


which each term occurs in.
March 8, 2020 6
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessor

Index File Indexer

friend 2 4

roman 1 2
Inverted file countryman 13 16
March 8, 2020
Index file Evaluation Metrics
 Running time of the main operations
 Access/search time
 How much is the running time to find the required search key
from the list?
 Update time (Insertion time, Deletion time)
 How much time does it take to update existing records in an
attempt to add new terms or delete existing unnecessary terms?
 Does the indexing structure allows incremental update or re-
indexing?
 Space overhead
 Computer storage space consumed for keeping the list.

March 8, 2020 8
Building Index file
 An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term
An index file is a list of search terms that are organized for
associative look-up, i.e., to answer user’s query:
In which documents does a specified search term appear?

Where within each document does each term appear? (There

may be several occurrences.)


For organizing index file for a collection of documents, there

are various options available:


Decide what data structure and/or file structure to use. Is it

sequential file, inverted file, suffix tree, etc. ?


March 8, 2020 9
Sequential File

 Sequential file is the most primitive file structures.


 It has no vocabulary as well as linking pointers.

 The records are generally arranged serially, one after


another, but in lexicographic order on the value of some key
field. i.e
 a particular attribute is chosen as primary key whose value

will determine the order of the records.


 when the first key fails to discriminate among records, a

second key is chosen to give an order.

March 8, 2020 10
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can


Doc 2
make it easier
to do difficult tasks
March 8, 2020 11
Sorting the
Vocabulary
Sequential file
 After all documents
have been tokenized,
stop words are
removed, and
normalization and
stemming are applied,
to generate index
terms
 These index terms in
sequential file are
sorted in alphabetical
order

March 8, 2020
Sequential File

 To access records search serially;


 starting at the first record read and investigate
all the succeeding records until the required
record is found or end of the file is reached.
 Update options: Is the index needs to be rebuilt
or incremental update is supported?

March 8, 2020 13
Sequential File …

Its main advantages:


 easy to implement;

 provides fast access to the next record using lexicographic

order.
 Can be searched quickly, using binary search, O(log n)

 Its disadvantages:

 No weights attached to terms.


 Random access is slow: since similar terms are indexed
individually, we need to find all terms that match with
the query
March 8, 2020 14
Inverted file
 A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low
cost risk. On a text of n words an inverted index can be built in
O(n) time
 This list is inverted from a list of terms in location order to a
list of terms in alphabetical order.
Word IDs
Word Extraction

Original
Documents •W1:d1,d2,d3
•W2:d2,d4,d7,d9
•…
•Wn :di,…dn
Document IDs
March 8, 2020 •Inverted Files 15
Use of Inverted Files for
Calculating Similarities
In the term vector space, if q is query and dj a document,
then q and dj have no terms in common iff q.dj = 0.
1. To calculate all the non-zero similarities find R, the set of all
the documents, dj, that contain at least one term in the query:
2. Merge the inverted lists for each term ti in the query, with a
logical or, to establish the set, R.
3. For each dj  R, calculate Similarity(q, dj), using appropriate
weights.
4. Return the elements of R in ranked order.

March 8, 2020 16
Inverted file
Data to be held in the inverted file includes
 The vocabulary (List of terms):

 is the set of all distinct words (index terms) in the text

collection.
having information about vocabulary (list of terms) speeds

searching for relevant documents


 For each term: the inverted file contains information related to
Location: all the text locations/positions where the word
occurs
frequency of occurrence of terms in a document collection

March 8, 2020 17
Enhancements to Inverted Files --
Concept
Location: Each posting holds information about the location of
each term within the document.
Uses
user interface design -- highlight location of search term
adjacency and near operators (in Boolean searching)
Frequency: Each inverted list includes the number of postings
for each term.
Uses
term weighting
query processing optimization

March 8, 2020 18
Inverted file
 Having information about the location of each term within
the document helps for:

 user interface design: highlight location of search term

 proximity based ranking: adjacency and near operators (in


Boolean searching)

 Having information about frequency is used for:

 calculating term weighting (like TF, TF*IDF, …)

 optimizing query processing

19
March 8, 2020
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location This is called an
term 1 3 2 1 66 index file.
19 1 213 Text operations
29 1 45 are performed
before building
term 2 4 3 1 94
the index.
19 2 7, 212
22 1 56
term 3 1 5 1 43
term 4 3 11 2 3, 70 CF, total
34 1 40 frequency of tj in
the corpus n
Is it possible to keep all these information during searching?
March 8, 2020 20
Construction of Inverted file
An inverted index consists of two files: vocabulary and posting
files
 A vocabulary file (Word list):

 stores all of the distinct terms (keywords) that appear in


any of the documents (in lexicographical order, i.e like that
of a dictionary) and
 For each word a pointer to a posting file

 Records kept for each term j in the vocabulary (word list)


contains the following:
 term j

 number of documents in which term j occurs (DFj)

 Collection frequency of term j (Cf)

 pointer to inverted (postings)21list for term j


Postings File (Inverted List)
 For each distinct term in the vocabulary, the posting file stores a
list of pointers to the documents that contain that term.
 Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
 Each list consists of one or many individual postings
Advantage of dividing inverted file into vocabulary and
posting:
 Keeping a pointer in the vocabulary to the list in the posting file
allows:
 the vocabulary to be kept in memory at search time even for
large text collection, while the Posting file is kept on disk
for accessing the pointers to documents
March 8, 2020 22
General structure of Inverted File
 The following figure shows the general structure of inverted
index file.

March 8, 2020 23
Organization of Index File
Vocabulary
(word list)
Postings
Documents
(inverted list)
Pointer
Term DF CF To
posting

term 1 3 3 Inverted
term 2 3 4 lists

term 3 1 1

term 4 2 3

March 8, 2020 24
Example:
 Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

Negative affect
Doc 1 can make it harder
to do even easy tasks.
so make it easy

positive affect can


Doc 2 make it easier
to do difficult tasks
March 8, 2020 25
Sorting the
Vocabulary
 After all documents
have been tokenized the
inverted file is sorted by
terms
 Steps

 Extract the terms in


each doc
 Sort the terms

 Compile the terms


i.e Collect the
frequencies for each
term

March 8, 2020
Remove stop words and compute
frequency
 Multiple term
entries in a
single
document are
merged and
frequency
information
added

March 8, 2020 27
stemming & compute frequency

 Multiple term
entries in a
single
document are
merged and
frequency
information
added

28
March 8, 2020
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting

Doc # TF
Term DF CF 1 1
affect 2 2 2 1
difficult 1 1 1 1
do 2 2 1 1
2 1
easy 2 3
1 2
hard 1 1 2 1
make 2 3 1 2
negative 1 1 2
1
2 1
positive 2 1 1 1
task 2 2 2 1
1 1
29 Pointers
Searching on Inverted File
 Since the whole index file is divided into two, searching can
be done faster by loading vocabulary list which takes less
memory even for large document collection
 Using binary Search the searching takes logarithmic time
 The search is done in the vocabulary lists

 Updating inverted file is complex.


 We need to update both vocabulary and posting files

March 8, 2020 30
Example: Create Inverted file
 Create an inverted file (both the vocabulary list and
the posting file) for the following document
collection
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
March 8, 2020 31
Example: Create Inverted file
 After text operation red color terms remain as index
term
D1 The Department of Computer Science was established
in 1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.

March 8, 2020 32
…. Example: Create Inverted file
After text operation performed
 D1= department comput science establish
 D2= department launch bsc comput study
 D3= follow msc comput science start
 D4= department produce phd graduat
 D5= staff contribut intellect profession advance field

March 8, 2020 33
vocabulary posting All term specific
word DF CF WID
Doc# TF mTF loc info. (max tf, tf, tf-
advance 1 1 w1 5 1 1 5 idf, location…etc.)
bsc 1 1 W2 2 1 1 3 Stored on posting
comput 3 3 W3 1 1 1 2
•W1:d5
contribut 1 1 W4 2 1 1 4 •W2:d2
department 3 3 W5 3 1 1 2 •W3:d1,d2,d3
5 1 1 4 •Wn :di,…dn
establish 1 1 W6
field 1 1 W7 1 1 1 1

follow 1 1 2 1 1 1
document file
graduat 1 1 4 1 1 1

intellect 1 1 Pointers 1 1
c
launch 1 1 1 1
o
1 1 1 1
msc n
phd 1 1t 1 1

produce 1 1i 1 1

profession 1 1n 1 1

science 2 2u 2 2

staff 1 1e 1 1

start 1 1 1 1

study 1 1 1 1
Suffix trie
•A suffix trie is an ordinary trie in which the input strings are all
possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each symbol
in a text an index corresponding to its position in the text. (i.e:
First symbol has index 1, last symbol has index n (#of symbols in
text).
• To build the suffix TRIE we use these indices instead of the actual object.
•The structure has several advantages:
• It requires less storage space.
• We do not have to worry how the text is represented (binary,
ASCII, etc).
• We do not have to store the same object twice (no duplicate).
March 8, 2020 35
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from left to
right as per characters occurrence in the string.
• TEXT: G O O G O L $
POSITION: 1 2 3 4 567
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
March 8, 2020 36
Suffix tree
 A suffix tree is an extension of suffix
trie that construct a Trie of all the
proper suffixes of S
 The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
 We store pointers rather than words in
the leaves.
 It is also possible to replace strings
in every edge by a pair (a,b), where
a & b are the beginning and end
index of the string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
March 8, 2020 37
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie of all
suffixes of s=abab$
•We label each leaf with the
starting point of the
•{ corresponding suffix.
• $ •$
• b$ •ab
•b •5
• ab$ •$
• bab$
• abab$ } •$ •ab$ •4
•ab$
•3
•2
•1
March 8, 2020 39
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a compressed trie of
all suffixes of s  S
•To make suffixes prefix-free we add a special char, $, at the end of s.
•To associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a special
terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
{ •#
•a •$
5. $ 4. # •b
4. b$ 3. b# •# •5 •4
•b
3. ab$ 2. ab# •ab$ •ab$ •$
•3
2. bab$ 1. aab#
•ab$ •$ •# •1
•4
1. abab$ •2
} 8, 2020 •1 •3 40 •2
March
Search in suffix tree
 Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
 Pseudo-code for searching in suffix tree:

 Start at root

 Go down the tree by taking each time the corresponding path

 If S correspond to a node, then return all leaves in sub-tree

 the places where S can be found are given by the pointers in

all the leaves in the subtree rooted at x.


 If S encountered a NIL pointer before reaching the end, then S is
not in the tree

March 8, 2020 41
Search in suffix tree
Example:
1. Find GO
2. Find OR
 If S = "GO" we take the GO
path and return:
GOOGOL$,GOL$.
 If S = "OR" we take the O path
and then we hit a NIL pointer so
"OR" is not in the tree.

March 8, 2020 42
Exercise

 Given the following index terms:


worker, word and world
construct index file using suffix tree?

March 8, 2020 43
Suffix Tree Applications
 Suffix Tree can be used to solve a large number of string
problems that occur in:
 text-editing,

 free-text search,

 etc.

 Some examples of string problems are given below.

 String matching

 Longest Common Substring

 Longest Repeated Substring

 Palindromes

March8,etc..
2020 44
Complexity Analysis
 The suffix tree for a string has been built in O(n2) time.
 Searching is very fast: The search time is linear in the length of
string S.
 The number of leaves is n+1, where n is the number of input

strings.
 Furthermore, in the leaves, we may store either the strings

themselves or pointers to the strings (that is, integers).


 Searching for a substring[1..m], in string[1..n], can be solved in
O(m) time.

March 8, 2020 45
Thank you

March 8, 2020 46

You might also like