0% found this document useful (0 votes)
11 views

Document Indexing in Information Retrieval:

Document indexing is a critical step in information retrieval, where large amounts of data are pre-processed to facilitate fast and accurate retrieval. It involves breaking down documents into terms, creating an inverted index, and assigning metadata to optimize search efficiency. Indexing strategies play a significant role in enhancing retrieval speed and the precision of search results.

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Document Indexing in Information Retrieval:

Document indexing is a critical step in information retrieval, where large amounts of data are pre-processed to facilitate fast and accurate retrieval. It involves breaking down documents into terms, creating an inverted index, and assigning metadata to optimize search efficiency. Indexing strategies play a significant role in enhancing retrieval speed and the precision of search results.

Uploaded by

Jawad Abid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction to

Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Sec. 2.3

Recall basic merge


• Walk through the two postings simultaneously, in time
linear in the total number of postings entries

2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar

If the list lengths are m and n, the merge takes O(m+n)


operations.

Can we do better?
Yes (if index isn’t changing too fast).
Sec. 2.3

Augment postings with skip pointers


(at indexing time)
41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

• Why?
• To skip postings that will not figure in the
search results.
• How?
• Where do we place skip pointers?
Sec. 2.3

Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8


on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so


we can skip ahead past the intervening postings.
Sec. 2.3

Where do we place skips?


• Tradeoff:
• More skips  shorter skip spans  more likely to skip. But lots
of comparisons to skip pointers.
• Fewer skips  few pointer comparison, but then long skip
spans  few successful skips.
Sec. 2.3

Placing skips
• Simple heuristic: for postings of length L, use L evenly-
spaced skip pointers.
• This ignores the distribution of query terms.
• Easy if the index is relatively static; harder if L keeps
changing because of updates.

• This definitely used to help; with modern hardware it


may not (Bahle et al. 2002) unless you’re memory-
based
• The I/O cost of loading a bigger postings list can outweigh the
gains from quicker in memory merging!
Introduction to
Information Retrieval
Phrase queries and positional indexes
Introduction to Information Retrieval Sec. 2.4

Phrase queries
 We want to be able to answer queries such as “stanford university” – as
a phrase
 Thus the sentence “I went to university at Stanford” is not a match.
 The concept of phrase queries has proven easily understood by users; one of the
few “advanced search” ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes


 Index every consecutive pair of terms in the text as a phrase
 For example the text “Friends, Romans, Countrymen” would generate
the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two-word phrase query-processing is now immediate.
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries


 Longer phrases can be processed by breaking them down
 stanford university palo alto can be broken into the Boolean query on
biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs matching the above
Boolean query do contain the phrase.

Can have false positives!


Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes


 False positives, as noted before
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them

 Biword indexes are not the standard solution (for all biwords) but can be
part of a compound strategy
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes


 In the postings, store, for each term the position(s) in which tokens of it
appear:

<term, number of docs containing term;


doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

 For phrase queries, we use a merge algorithm


recursively at the document level
 But we now need to deal with more than just
equality
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query


 Extract inverted index entries for each distinct term: to, be, or, not.
 Merge their doc:position lists to enumerate all positions with “to be or
not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
 Again, here, /k means “within k words of”.
 Clearly, positional indexes can be used for such queries; biword indexes
cannot.
 Exercise: Adapt the linear merge of postings to handle proximity
queries. Can you make it work for any value of k?
 This is a little tricky to do correctly and efficiently
 See Figure 2.12 of IIR
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


 A positional index expands postings storage substantially
 Even though indices can be compressed
 Nevertheless, a positional index is now standardly used because of the
power and usefulness of phrase and proximity queries … whether used
explicitly or implicitly in a ranking retrieval system.
Introduction to Information Retrieval Sec. 2.4.2

Positional index size


 Need an entry for each occurrence, not just once per document
 Index size depends on average document size
 Average web page has <1000 terms Why?
 SEC filings, books, even some epic poems … easily 100,000 terms
 Consider a term with frequency 0.1%

Document size Postings Positional postings


1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
 A positional index is 2–4 as large as a non-positional index

 Positional index size 35–50% of volume of original text

 Caveat: all of this holds for “English-like” languages


Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
 These two approaches can be profitably combined
 For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to
keep on merging positional postings lists
 Even more so for phrases like “The Who”
 Williams et al. (2004) evaluate a more sophisticated mixed indexing
scheme
 A typical web query mixture was executed in ¼ of the time of using just a
positional index
 It required 26% more space than having a positional index alone

You might also like