0% found this document useful (0 votes)

48 views

04const Flat

The document describes an introduction to index construction for information retrieval. It provides an overview of the Binary Search B-tree Indexing (BSBI) algorithm and Sorted Postings In Memory Indexing (SPIMI) algorithm for constructing indexes. It also discusses distributed and dynamic indexing. The document uses the Reuters RCV1 collection as an example dataset and provides statistics about its size and structure. It discusses challenges with scaling simple sorting-based indexing to very large collections due to disk I/O performance.

Uploaded by

Birhane Bekele Kebede

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

04const Flat

Uploaded by

Birhane Bekele Kebede

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Introduction to Information Retrieval

http://informationretrieval.org

IIR 4: Index Construction

Hinrich Schütze

Center for Information and Language Processing, University of Munich

2014-04-16

1 / 54
Overview

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

2 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

3 / 54
Dictionary as array of fixed-width entries

term document pointer to

frequency postings list
a 656,265 −→
aachen 65 −→
... ... ...
zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes

4 / 54
B-tree for looking up entries in array

5 / 54
Wildcard queries using a permuterm index

Queries:
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up
Y$X*

6 / 54
k-gram indexes for spelling correction: bordroom

bo ✲ aboard ✲ about ✲ boardroom ✲ border

or ✲ border ✲ lord ✲ morbid ✲ sordid

rd ✲ aboard ✲ ardent ✲ boardroom ✲ border

7 / 54
Levenshtein distance for spelling correction
LevenshteinDistance(s1 , s2 )
1 for i ← 0 to |s1 |
2 do m[i , 0] = i
3 for j ← 0 to |s2 |
4 do m[0, j] = j
5 for i ← 1 to |s1 |
6 do for j ← 1 to |s2 |
7 do if s1 [i ] = s2 [j]
8 then m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1]}
9 else m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1] + 1}
10 return m[|s1 |, |s2 |]
Operations: insert, delete, replace, copy

8 / 54
Exercise: Understand Peter Norvig’s spelling corrector
import re, collections
def words(text): return re.findall(’[a-z]+’, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(’big.txt’).read()))
alphabet = ’abcdefghijklmnopqrstuvwxyz’
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) gt 1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in
edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or
known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)

9 / 54
Take-away

Two index construction algorithms: BSBI (simple) and SPIMI

(more realistic)
Distributed index construction: MapReduce
Dynamic index construction: how to keep the index
up-to-date as the collection changes

10 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

11 / 54
Hardware basics

Many design decisions in information retrieval are based on

hardware constraints.
We begin by reviewing hardware basics that we’ll need in this
course.

12 / 54
Hardware basics

Access to data is much faster in memory than on disk.

(roughly a factor of 10)
Disk seeks are “idle” time: No data is transferred from disk
while the disk head is being positioned.
To optimize transfer time from disk to memory: one large
chunk is faster than many small chunks.
Disk I/O is block-based: Reading and writing of entire blocks
(as opposed to smaller chunks). Block sizes: 8KB to 256 KB
Servers used in IR systems typically have many GBs of main
memory and TBs of disk space.
Fault tolerance is expensive: It’s cheaper to use many regular
machines than one fault tolerant machine.

13 / 54
Some stats (ca. 2008)
symbol statistic value
s average seek time 5 ms = 5 × 10−3 s
b transfer time per byte 0.02 µs = 2 × 10−8 s
processor’s clock rate 109 s−1
p lowlevel operation (e.g., compare & swap a word) 0.01 µs = 10−8 s
size of main memory several GB
size of disk space 1 TB or more

14 / 54
RCV1 collection

Shakespeare’s collected works are not large enough for

demonstrating many of the points in this course.
As an example for applying scalable index construction
algorithms, we will use the Reuters RCV1 collection.
English newswire articles sent over the wire in 1995 and 1996
(one year).

15 / 54
A Reuters RCV1 document

16 / 54
Reuters RCV1 statistics

N documents 800,000
L tokens per document 200
M terms (= word types) 400,000
bytes per token (incl. spaces/punct.) 6
bytes per token (without spaces/punct.) 4.5
bytes per term (= word type) 7.5
T non-positional postings 100,000,000
Exercise: Average frequency of a term (how many tokens)? 4.5

bytes per word token vs. 7.5 bytes per word type: why the
difference? How many positional postings?

17 / 54
Exercise

Why does this algorithm not scale to very large collections?

18 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

19 / 54
Goal: construct the inverted index

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 ...

Calpurnia −→ 2 31 54 101

..
.
| {z } | {z }
dictionary postings

20 / 54
Index construction in IIR 1: Sort postings in memory
term docID term docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2
=⇒ it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

21 / 54
Sort-based index construction

As we build index, we parse docs one at a time.

The final postings for any term are incomplete until the end.
Can we keep all postings in memory and then do the sort
in-memory at the end?
No, not for large collections
Thus: We need to store intermediate results on disk.

22 / 54
Same algorithm for disk?

Can we use the same index construction algorithm for larger

collections, but by using disk instead of memory?
No: Sorting very large sets of records on disk is too slow – too
many disk seeks.
We need an external sorting algorithm.

23 / 54
“External” sorting algorithm (using few disk seeks)

We must sort T = 100,000,000 non-positional postings.

Each posting has size 12 bytes (4+4+4: termID, docID, term
frequency).
Define a block to consist of 10,000,000 such postings
We can easily fit that many postings into memory.
We will have 10 such blocks for RCV1.
Basic idea of algorithm:
For each block: (i) accumulate postings, (ii) sort in memory,
(iii) write to disk
Then merge the blocks into one long sorted order.

24 / 54
Merging two blocks

postings
to be merged brutus d2
brutus d3
Block 1 Block 2
caesar d1
brutus d3 brutus d2
caesar d4 merged
caesar d4 caesar d1
julius d1 postings
noble d3 julius d1
killed d2
with d4 killed d2
noble d3
with d4

disk

25 / 54
Blocked Sort-Based Indexing

BSBIndexConstruction()
1 n←0
2 while (all documents have not been processed)
3 do n ← n + 1
4 block ← ParseNextBlock()
5 BSBI-Invert(block)
6 WriteBlockToDisk(block, fn )
7 MergeBlocks(f1 , . . . , fn ; f merged )

26 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

27 / 54
Problem with sort-based algorithm

Our assumption was: we can keep the dictionary in memory.

We need the dictionary (which grows dynamically) in order to
implement a term to termID mapping.
Actually, we could work with term,docID postings instead of
termID,docID postings . . .
. . . but then intermediate files become very large. (We would
end up with a scalable, but very slow index construction
method.)

28 / 54
Single-pass in-memory indexing

Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block – no
need to maintain term-termID mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists
as they occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.

29 / 54
SPIMI-Invert
SPIMI-Invert(token stream)
1 output file ← NewFile()
2 dictionary ← NewHash()
3 while (free memory available)
4 do token ← next(token stream)
5 if term(token) ∈ / dictionary
6 then postings list ← AddToDictionary(dictionary ,term(token))
7 else postings list ← GetPostingsList(dictionary ,term(token))
8 if full (postings list)
9 then postings list ← DoublePostingsList(dictionary ,term(token))
10 AddToPostingsList(postings list,docID(token))
11 sorted terms ← SortTerms(dictionary )
12 WriteBlockToDisk(sorted terms,dictionary ,output file)
13 return output file
Merging of blocks is analogous to BSBI.

30 / 54
SPIMI: Compression

Compression makes SPIMI even more efficient.

Compression of terms
Compression of postings
See next lecture

31 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

32 / 54
Distributed indexing

For web-scale indexing (don’t try this at home!): must use a

distributed computer cluster
Individual machines are fault-prone.
Can unpredictably slow down or fail.
How do we exploit such a pool of machines?

33 / 54
Google data centers (2007 estimates; Gartner)

Google data centers mainly contain commodity machines.

Data centers are distributed all over the world.
1 million servers, 3 million processors/cores
Google installs 100,000 servers each quarter.
Based on expenditures of 200–250 million dollars per year
This would be 10% of the computing capacity of the world!
If in a non-fault-tolerant system with 1000 nodes, each node
has 99.9% uptime, what is the uptime of the system
(assuming it does not tolerate failures)?
Answer: 37%
Suppose a server will fail after 3 years. For an installation of 1
million servers, what is the interval between machine failures?
Answer: less than two minutes

34 / 54
Distributed indexing

Maintain a master machine directing the indexing job –

considered “safe”
Break up indexing into sets of parallel tasks
Master machine assigns each task to an idle machine from a
pool.

35 / 54
Parallel tasks

We will define two sets of parallel tasks and deploy two types
of machines to solve them:
Parsers
Inverters
Break the input document collection into splits (corresponding
to blocks in BSBI/SPIMI)
Each split is a subset of documents.

36 / 54
Parsers

Master assigns a split to an idle parser machine.

Parser reads a document at a time and emits
(term,docID)-pairs.
Parser writes pairs into j term-partitions.
Each for a range of terms’ first letters
E.g., a-f, g-p, q-z (here: j = 3)

37 / 54
Inverters

An inverter collects all (term,docID) pairs (= postings) for

one term-partition (e.g., for a-f).
Sorts and writes to postings lists

38 / 54
Data flow

splits assign master assign

postings

parser a-f g-p q-z inve rter a-f

parser a-f g-p q-z g-p

inve rter

inve rter q-z

parser a-f g-p q-z

segment reduce
map files
phase phase

39 / 54
MapReduce

The index construction algorithm we just described is an

instance of MapReduce.
MapReduce is a robust and conceptually simple framework for
distributed computing . . .
. . . without having to write code for the distribution part.
The Google indexing system (ca. 2002) consisted of a number
of phases, each implemented in MapReduce.
Index construction was just one phase.
Another phase: transform term-partitioned into
document-partitioned index.
Why might a document-partitioned index be preferable?

40 / 54
Index construction in MapReduce
Schema of map and reduce functions
map: input → list(k, v )
reduce: (k,list(v )) → output

Instantiation of the schema for index construction

map: web collection → list(termID, docID)
reduce: (htermID1 , list(docID)i, htermID2 , list(docID)i, . . . ) → (postings list1 , postings list2 , . . . )

Example for index construction

map: d2 : C died. d1 : C came, C c’ed. → (hC, d2 i, hdied,d2 i, hC,d1 i, hcame,d1 i, hC,d1 i, hc’ed,d1 i)
reduce: (hC,(d2 ,d1 ,d1 )i,hdied,(d2 )i,hcame,(d1 )i,hc’ed,(d1 )i) → (hC,(d1:2,d2:1)i,hdied,(d2:1)i,hcame,(d1:1)i,hc’ed,(d1:1)i)

41 / 54
Exercise

What information does the task description contain that the

master gives to a parser?
What information does the parser report back to the master
upon completion of the task?
What information does the task description contain that the
master gives to an inverter?
What information does the inverter report back to the master
upon completion of the task?

42 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

43 / 54
Dynamic indexing

Up to now, we have assumed that collections are static.

They rarely are: Documents are inserted, deleted and
modified.
This means that the dictionary and postings lists have to be
dynamically modified.

44 / 54
Dynamic indexing: Simplest approach

Maintain big main index on disk

New docs go into small auxiliary index in memory.
Search across both, merge results
Periodically, merge auxiliary index into big index
Deletions:
Invalidation bit-vector for deleted docs
Filter docs returned by index using this bit-vector

45 / 54
Issue with auxiliary and main index

Frequent merges
Poor search performance during index merge

46 / 54
Logarithmic merge

Logarithmic merging amortizes the cost of merging indexes

over time.
→ Users see smaller effect on response times.
Maintain a series of indexes, each twice as large as the
previous one.
Keep smallest (Z0 ) in memory
Larger ones (I0 , I1 , . . . ) on disk
If Z0 gets too big (> n), write to disk as I0
. . . or merge with I0 (if I0 already exists) and write merger to
I1 etc.

47 / 54
LMergeAddToken(indexes, Z0 , token)
1 Z0 ← Merge(Z0 , {token})
2 if |Z0 | = n
3 then for i ← 0 to ∞
4 do if Ii ∈ indexes
5 then Zi +1 ← Merge(Ii , Zi )
6 (Zi +1 is a temporary index on disk.)
7 indexes ← indexes − {Ii }
8 else Ii ← Zi (Zi becomes the permanent index Ii .)
9 indexes ← indexes ∪ {Ii }
10 Break
11 Z0 ← ∅

LogarithmicMerge()
1 Z0 ← ∅ (Z0 is the in-memory index.)
2 indexes ← ∅
3 while true
4 do LMergeAddToken(indexes, Z0 , getNextToken())

48 / 54
Binary numbers: I3 I2I1 I0 = 23222120

0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100

49 / 54
Logarithmic merge

Number of indexes bounded by O(log T ) (T is total number

of postings read so far)
So query processing requires the merging of O(log T ) indexes
Time complexity of index construction is O(T log T ).
. . . because each of T postings is merged O(log T ) times.
Auxiliary index: index construction time is O(T 2 ) as each
posting is touched in each merge.
Suppose auxiliary index has size a
a + 2a + 3a + 4a + . . . + na = a n(n+1)
2 = O(n2 )
So logarithmic merging is an order of magnitude more
efficient.

50 / 54
Dynamic indexing at large search engines

Often a combination
Frequent incremental changes
Rotation of large parts of the index that can then be swapped
in
Occasional complete rebuild (becomes harder with increasing
size – not clear if Google can do a complete rebuild)

51 / 54
Building positional indexes

Basically the same problem except that the intermediate data

structures are large.

52 / 54
Take-away

Two index construction algorithms: BSBI (simple) and SPIMI

(more realistic)
Distributed index construction: MapReduce
Dynamic index construction: how to keep the index
up-to-date as the collection changes

53 / 54
Resources

Chapter 4 of IIR
Resources at http://cislmu.org
Original publication on MapReduce by Dean and Ghemawat
(2004)
Original publication on SPIMI by Heinz and Zobel (2003)
YouTube video: Google data centers

54 / 54

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Index Construction
No ratings yet
Index Construction
42 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Lecture 4 - Index Construction _ Compressing
No ratings yet
Lecture 4 - Index Construction _ Compressing
90 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
Information Retrieval - 2
No ratings yet
Information Retrieval - 2
24 pages
Lec6 InvretedIndex pt2
No ratings yet
Lec6 InvretedIndex pt2
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
Lecture 4-Indexconstruction
No ratings yet
Lecture 4-Indexconstruction
45 pages
IR Indexing
No ratings yet
IR Indexing
15 pages
C3 IndexConstruction
No ratings yet
C3 IndexConstruction
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
49 pages
1726119671-4 Index Construction
No ratings yet
1726119671-4 Index Construction
19 pages
Lecture4-Indexconstruction Ch2 and Ch4
No ratings yet
Lecture4-Indexconstruction Ch2 and Ch4
49 pages
Algorithms For Information Retrieval: Index Construction
No ratings yet
Algorithms For Information Retrieval: Index Construction
12 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
Chapter4 Indexconstruction
No ratings yet
Chapter4 Indexconstruction
49 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
UE20CS332 Unit2 Slides PDF
No ratings yet
UE20CS332 Unit2 Slides PDF
264 pages
L05
No ratings yet
L05
33 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Unit I
No ratings yet
Unit I
83 pages
Index Construction
No ratings yet
Index Construction
37 pages
Learning Guide Unit 2 _ Home
No ratings yet
Learning Guide Unit 2 _ Home
11 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
CS 3308 Discussion Assignment Unit 2
No ratings yet
CS 3308 Discussion Assignment Unit 2
6 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
indexing_1
No ratings yet
indexing_1
61 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
3_Index Construction 3560e51d31af433180d259cbc5729509
No ratings yet
3_Index Construction 3560e51d31af433180d259cbc5729509
5 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
03 Dictionaries
No ratings yet
03 Dictionaries
112 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
chapter2-MA212-Indexing+&+Preprocessing
No ratings yet
chapter2-MA212-Indexing+&+Preprocessing
68 pages
Dynamic Indexing
No ratings yet
Dynamic Indexing
53 pages
Index Construction
No ratings yet
Index Construction
48 pages
IR_MOD4_NOTES
No ratings yet
IR_MOD4_NOTES
19 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
9 pages
lec9
No ratings yet
lec9
21 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
lec9
No ratings yet
lec9
21 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Lab 4a
No ratings yet
Lab 4a
12 pages
file_allocation_simulation
No ratings yet
file_allocation_simulation
5 pages
Sat 29.PDF Spam Checker
No ratings yet
Sat 29.PDF Spam Checker
11 pages
Vue开发规范-V1 0
No ratings yet
Vue开发规范-V1 0
20 pages
Excel Advance
No ratings yet
Excel Advance
6 pages
Implementation of SAP Odata V4 - SAP Community
No ratings yet
Implementation of SAP Odata V4 - SAP Community
31 pages
Plattech TP
No ratings yet
Plattech TP
4 pages
Final Exam - 2 PDF
No ratings yet
Final Exam - 2 PDF
5 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
Enterprise FW 05-Central Management
No ratings yet
Enterprise FW 05-Central Management
34 pages
mca-2-sem-java-technologies-2c8121-2023
No ratings yet
mca-2-sem-java-technologies-2c8121-2023
2 pages
Zfi Adr Report
No ratings yet
Zfi Adr Report
10 pages
Ways of Data Storage and Retrieval in Android App
No ratings yet
Ways of Data Storage and Retrieval in Android App
2 pages
Go (Programming Language)
No ratings yet
Go (Programming Language)
19 pages
Simulink Tutorial
No ratings yet
Simulink Tutorial
15 pages
Tutorial: Operations Research in Constraint Programming: John Hooker Carnegie Mellon University
No ratings yet
Tutorial: Operations Research in Constraint Programming: John Hooker Carnegie Mellon University
329 pages
VLSI Lab 9
No ratings yet
VLSI Lab 9
33 pages
Python Assignment 1
No ratings yet
Python Assignment 1
8 pages
Fuzezitufususe Ruby On Rails Book PDF Polarewupe PDF
No ratings yet
Fuzezitufususe Ruby On Rails Book PDF Polarewupe PDF
4 pages
Metacharacters: Any Character, Except A Newline. For Example, The Regexp
No ratings yet
Metacharacters: Any Character, Except A Newline. For Example, The Regexp
2 pages
Part 1: HTML and CSS: IN5320 - Development in Platform Ecosystems Individual Practical Assignment 1
No ratings yet
Part 1: HTML and CSS: IN5320 - Development in Platform Ecosystems Individual Practical Assignment 1
6 pages
Expression Tree
No ratings yet
Expression Tree
4 pages
Object Oriented Programming With Fortran 2003: Patrick Tripp - IMSG at EMC/NCEP June 26, 2014 NCWCP, College Park, MD
No ratings yet
Object Oriented Programming With Fortran 2003: Patrick Tripp - IMSG at EMC/NCEP June 26, 2014 NCWCP, College Park, MD
25 pages
Adobe Form Dynamic Variable
No ratings yet
Adobe Form Dynamic Variable
14 pages
Prev CSC Log
No ratings yet
Prev CSC Log
9 pages
String
No ratings yet
String
33 pages
Capgemini Interview Questions
No ratings yet
Capgemini Interview Questions
15 pages
Uvm Concepts
No ratings yet
Uvm Concepts
68 pages
Invoking Servlet From HTML Forms: Ex - No:6a Date
No ratings yet
Invoking Servlet From HTML Forms: Ex - No:6a Date
6 pages
Debre Markos University: Institute of Technology
No ratings yet
Debre Markos University: Institute of Technology
10 pages

04const Flat

Uploaded by

04const Flat

Uploaded by

Introduction to Information Retrieval

IIR 4: Index Construction

Center for Information and Language Processing, University of Munich

term document pointer to

bo ✲ aboard ✲ about ✲ boardroom ✲ border

or ✲ border ✲ lord ✲ morbid ✲ sordid

rd ✲ aboard ✲ ardent ✲ boardroom ✲ border

Two index construction algorithms: BSBI (simple) and SPIMI

Many design decisions in information retrieval are based on

Access to data is much faster in memory than on disk.

Shakespeare’s collected works are not large enough for

Why does this algorithm not scale to very large collections?

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 ...

As we build index, we parse docs one at a time.

Can we use the same index construction algorithm for larger

We must sort T = 100,000,000 non-positional postings.

Our assumption was: we can keep the dictionary in memory.

Compression makes SPIMI even more efficient.

For web-scale indexing (don’t try this at home!): must use a

Google data centers mainly contain commodity machines.

Maintain a master machine directing the indexing job –

Master assigns a split to an idle parser machine.

An inverter collects all (term,docID) pairs (= postings) for

splits assign master assign

parser a-f g-p q-z inve rter a-f

parser a-f g-p q-z g-p

inve rter q-z

The index construction algorithm we just described is an

Instantiation of the schema for index construction

Example for index construction

What information does the task description contain that the

Up to now, we have assumed that collections are static.

Maintain big main index on disk

Logarithmic merging amortizes the cost of merging indexes

Number of indexes bounded by O(log T ) (T is total number

Basically the same problem except that the intermediate data

Two index construction algorithms: BSBI (simple) and SPIMI

You might also like