0% found this document useful (0 votes)

11 views

Document Indexing in Information Retrieval:

Document indexing is a critical step in information retrieval, where large amounts of data are pre-processed to facilitate fast and accurate retrieval. It involves breaking down documents into terms, creating an inverted index, and assigning metadata to optimize search efficiency. Indexing strategies play a significant role in enhancing retrieval speed and the precision of search results.

Uploaded by

Jawad Abid

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Document Indexing in Information Retrieval:

Uploaded by

Jawad Abid

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Introduction to

Information Retrieval
Faster postings merges:
Skip pointers/Skip lists
Sec. 2.3

Recall basic merge

• Walk through the two postings simultaneously, in time
linear in the total number of postings entries

2 4 8 41 48 64 128 Brutus
2 8
1 2 3 8 11 17 21 31 Caesar

If the list lengths are m and n, the merge takes O(m+n)

operations.

Can we do better?
Yes (if index isn’t changing too fast).
Sec. 2.3

Augment postings with skip pointers

(at indexing time)
41 128
2 4 8 41 48 64 128

11 31
1 2 3 8 11 17 21 31

• Why?
• To skip postings that will not figure in the
search results.
• How?
• Where do we place skip pointers?
Sec. 2.3

Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8

on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

we can skip ahead past the intervening postings.
Sec. 2.3

Where do we place skips?

• Tradeoff:
• More skips  shorter skip spans  more likely to skip. But lots
of comparisons to skip pointers.
• Fewer skips  few pointer comparison, but then long skip
spans  few successful skips.
Sec. 2.3

Placing skips
• Simple heuristic: for postings of length L, use L evenly-
spaced skip pointers.
• This ignores the distribution of query terms.
• Easy if the index is relatively static; harder if L keeps
changing because of updates.

• This definitely used to help; with modern hardware it

may not (Bahle et al. 2002) unless you’re memory-
based
• The I/O cost of loading a bigger postings list can outweigh the
gains from quicker in memory merging!
Introduction to
Information Retrieval
Phrase queries and positional indexes
Introduction to Information Retrieval Sec. 2.4

Phrase queries
 We want to be able to answer queries such as “stanford university” – as
a phrase
 Thus the sentence “I went to university at Stanford” is not a match.
 The concept of phrase queries has proven easily understood by users; one of the
few “advanced search” ideas that works
 Many more queries are implicit phrase queries
 For this, it no longer suffices to store only
<term : docs> entries
Introduction to Information Retrieval Sec. 2.4.1

A first attempt: Biword indexes

 Index every consecutive pair of terms in the text as a phrase
 For example the text “Friends, Romans, Countrymen” would generate
the biwords
 friends romans
 romans countrymen
 Each of these biwords is now a dictionary term
 Two-word phrase query-processing is now immediate.
Introduction to Information Retrieval Sec. 2.4.1

Longer phrase queries

 Longer phrases can be processed by breaking them down
 stanford university palo alto can be broken into the Boolean query on
biwords:
stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs matching the above
Boolean query do contain the phrase.

Can have false positives!

Introduction to Information Retrieval Sec. 2.4.1

Issues for biword indexes

 False positives, as noted before
 Index blowup due to bigger dictionary
 Infeasible for more than biwords, big even for them

 Biword indexes are not the standard solution (for all biwords) but can be
part of a compound strategy
Introduction to Information Retrieval Sec. 2.4.2

Solution 2: Positional indexes

 In the postings, store, for each term the position(s) in which tokens of it
appear:

<term, number of docs containing term;

doc1: position1, position2 … ;
doc2: position1, position2 … ;
etc.>
Introduction to Information Retrieval Sec. 2.4.2

Positional index example

<be: 993427;
1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
5: 363, 367, …>

 For phrase queries, we use a merge algorithm

recursively at the document level
 But we now need to deal with more than just
equality
Introduction to Information Retrieval Sec. 2.4.2

Processing a phrase query

 Extract inverted index entries for each distinct term: to, be, or, not.
 Merge their doc:position lists to enumerate all positions with “to be or
not to be”.
 to:
 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ...
 be:
 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ...
 Same general method for proximity searches
Introduction to Information Retrieval Sec. 2.4.2

Proximity queries
 LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
 Again, here, /k means “within k words of”.
 Clearly, positional indexes can be used for such queries; biword indexes
cannot.
 Exercise: Adapt the linear merge of postings to handle proximity
queries. Can you make it work for any value of k?
 This is a little tricky to do correctly and efficiently
 See Figure 2.12 of IIR
Introduction to Information Retrieval Sec. 2.4.2

Positional index size

 A positional index expands postings storage substantially
 Even though indices can be compressed
 Nevertheless, a positional index is now standardly used because of the
power and usefulness of phrase and proximity queries … whether used
explicitly or implicitly in a ranking retrieval system.
Introduction to Information Retrieval Sec. 2.4.2

Positional index size

 Need an entry for each occurrence, not just once per document
 Index size depends on average document size
 Average web page has <1000 terms Why?
 SEC filings, books, even some epic poems … easily 100,000 terms
 Consider a term with frequency 0.1%

Document size Postings Positional postings

1000 1 1
100,000 1 100
Introduction to Information Retrieval Sec. 2.4.2

Rules of thumb
 A positional index is 2–4 as large as a non-positional index

 Positional index size 35–50% of volume of original text

 Caveat: all of this holds for “English-like” languages

Introduction to Information Retrieval Sec. 2.4.3

Combination schemes
 These two approaches can be profitably combined
 For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to
keep on merging positional postings lists
 Even more so for phrases like “The Who”
 Williams et al. (2004) evaluate a more sophisticated mixed indexing
scheme
 A typical web query mixture was executed in ¼ of the time of using just a
positional index
 It required 26% more space than having a positional index alone

CLC CCIE EI Real Lab1 M1 1.1.15
No ratings yet
CLC CCIE EI Real Lab1 M1 1.1.15
96 pages
SAP Glossary
100% (1)
SAP Glossary
13 pages
Android MCQ Questions and Answers: Take Android MCQ Test & Online Quiz To Test Your Knowledge
No ratings yet
Android MCQ Questions and Answers: Take Android MCQ Test & Online Quiz To Test Your Knowledge
5 pages
IR Lec04 Skip Ptrs Phrase Queries Indexing
No ratings yet
IR Lec04 Skip Ptrs Phrase Queries Indexing
18 pages
Lecture 3-Skip Pointers and Phrase Queries
No ratings yet
Lecture 3-Skip Pointers and Phrase Queries
12 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
IRS-Lec06-24
No ratings yet
IRS-Lec06-24
13 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
7 Phrase Queries and Positional Indexes
No ratings yet
7 Phrase Queries and Positional Indexes
25 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
lecture1-intro
No ratings yet
lecture1-intro
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Ir 1
No ratings yet
Ir 1
59 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
2
No ratings yet
2
50 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
04 - Recuperación Información Modelo Booleano
No ratings yet
04 - Recuperación Información Modelo Booleano
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
L004
No ratings yet
L004
16 pages
Lecture3 Tolerant Retrieval Handout 6 Per
No ratings yet
Lecture3 Tolerant Retrieval Handout 6 Per
8 pages
IR Summary Lec 1 - Introduction
No ratings yet
IR Summary Lec 1 - Introduction
54 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Index Construction
No ratings yet
Index Construction
37 pages
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
No ratings yet
Module 4-Boolean Retrieval Models-Edit Distance, Spelling Correction
124 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Index Construction
No ratings yet
Index Construction
48 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
Unit 1
No ratings yet
Unit 1
181 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
SQL All-in-One For Dummies
From Everand
SQL All-in-One For Dummies
Allen G. Taylor
4.5/5 (2)
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Flash Vs Pipelined ADC and Subranging ADC
100% (1)
Flash Vs Pipelined ADC and Subranging ADC
2 pages
Teguh Afrian: Work Experience
No ratings yet
Teguh Afrian: Work Experience
1 page
eyeriss_isscc_2016
No ratings yet
eyeriss_isscc_2016
4 pages
Ctags Tutorial
No ratings yet
Ctags Tutorial
4 pages
Oops
No ratings yet
Oops
14 pages
English Presentasion For Student With Topic "Computer Club"
No ratings yet
English Presentasion For Student With Topic "Computer Club"
14 pages
Zebra PDF
No ratings yet
Zebra PDF
6 pages
What Is Classic ASP
No ratings yet
What Is Classic ASP
6 pages
System and Network Administration Linux Based: Lamp Configuration in Linux
No ratings yet
System and Network Administration Linux Based: Lamp Configuration in Linux
22 pages
NanoRoute Recommendations
No ratings yet
NanoRoute Recommendations
7 pages
Asus Test Report
No ratings yet
Asus Test Report
4 pages
Empowerment Technologies: Teachers Reference Guide
No ratings yet
Empowerment Technologies: Teachers Reference Guide
3 pages
537 First Half Exam Review 2023-2024
No ratings yet
537 First Half Exam Review 2023-2024
15 pages
DAA Lab Manual
No ratings yet
DAA Lab Manual
109 pages
SAP Real Estate
100% (1)
SAP Real Estate
1 page
ST - Unit2
No ratings yet
ST - Unit2
58 pages
HCIA-Storage V5.0 Exam Outline
No ratings yet
HCIA-Storage V5.0 Exam Outline
2 pages
Nokia - LTE - RL60 - Features Description
No ratings yet
Nokia - LTE - RL60 - Features Description
78 pages
11493PQC Data Manager en
No ratings yet
11493PQC Data Manager en
95 pages
TE Computer Engg 2019 Course Syllabus Draft 23may2021
No ratings yet
TE Computer Engg 2019 Course Syllabus Draft 23may2021
100 pages
CH 17
No ratings yet
CH 17
5 pages
Malware
No ratings yet
Malware
17 pages
SAP SD 1year Experience
No ratings yet
SAP SD 1year Experience
6 pages
HTML Technical MCQ
No ratings yet
HTML Technical MCQ
18 pages
Karima Tajin Updated
No ratings yet
Karima Tajin Updated
4 pages
EAT 250 Esteem Tutorial
100% (1)
EAT 250 Esteem Tutorial
25 pages
HRIS Presentation
100% (1)
HRIS Presentation
20 pages

Document Indexing in Information Retrieval:

Uploaded by

Document Indexing in Information Retrieval:

Uploaded by

Introduction to

Recall basic merge

If the list lengths are m and n, the merge takes O(m+n)

Augment postings with skip pointers

Query processing with skip pointers

Suppose we’ve stepped through the lists until we process 8

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so

Where do we place skips?

• This definitely used to help; with modern hardware it

A first attempt: Biword indexes

Longer phrase queries

Can have false positives!

Issues for biword indexes

Solution 2: Positional indexes

<term, number of docs containing term;

Positional index example

 For phrase queries, we use a merge algorithm

Processing a phrase query

Positional index size

Positional index size

Document size Postings Positional postings

 Positional index size 35–50% of volume of original text

 Caveat: all of this holds for “English-like” languages

You might also like