Lecture 3-Skip Pointers and Phrase Queries
Lecture 3-Skip Pointers and Phrase Queries
Retrieval
Introduction to
Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Introduction to Information
Retrieval
41
48
11
64
17
128
21
Brutus
31 Caesar
Introduction to Information
Retrieval
41
41
64
128
31
11
48
11
17
21
31
Why?
To skip postings that will not figure in the search
results.
How?
Where do we place skip pointers?
Introduction to Information
Retrieval
41
41
64
128
31
11
48
11
17
21
31
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Placing skips
Simple heuristic: for postings of length L, use L evenlyspaced skip pointers
[Moffat and Zobel 1996]
This ignores the distribution of query terms.
Easy if the index is relatively static; harder if L keeps
changing because of updates.
This definitely used to help; with modern hardware it may
not unless youre memory-based [Bahle et al. 2002]
Introduction to Information
Retrieval
Introduction to Information
Retrieval
1. Biword indexes
One approach to handling phrases is to consider
every pair of consecutive terms in a document as
a phrase.
For example, the text Friends, Romans,
Countrymen would generate the biwords:
friends romans
romans countrymen
Introduction to Information
Retrieval
2. Positional indexes
A biword index is not the standard solution.
Rather, a positional index is most commonly
employed.
Here, for each term in the vocabulary, we store
postings of the form docID: {hposition1,
position2, . . . } e.g.
to, 993427:
(1, 6: (7, 18, 33, 72, 86, 231);
2, 5: (1, 17, 74, 222, 255);
4, 5: (8, 16, 190, 429, 433);
5, 2: (363, 367);
7, 3: (13, 23, 191); ..... . . )
be, 178239:
(1, 2: (17, 25);
4, 5: (17, 191, 291, 430, 434);
Introduction to Information
Retrieval
2. Positional indexes
To process a phrase query, we still need to access
the inverted index entries for each distinct term.
As before, we would start with the least frequent
term and then work to further restrict the list of
possible candidates.
In the merge operation, the same general
technique is used as before, but rather than simply
checking that both terms are in a document, we
also need to check that their positions of
appearance in the document are compatible with
the phrase query being evaluated.
Introduction to Information
Retrieval
to:
be:
(. . . ; 4: (. . . ,429,433); . . . )
(. . . ; 4(. . . ,430,434); . . . )