0% found this document useful (0 votes)
48 views

04const Flat

The document describes an introduction to index construction for information retrieval. It provides an overview of the Binary Search B-tree Indexing (BSBI) algorithm and Sorted Postings In Memory Indexing (SPIMI) algorithm for constructing indexes. It also discusses distributed and dynamic indexing. The document uses the Reuters RCV1 collection as an example dataset and provides statistics about its size and structure. It discusses challenges with scaling simple sorting-based indexing to very large collections due to disk I/O performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

04const Flat

The document describes an introduction to index construction for information retrieval. It provides an overview of the Binary Search B-tree Indexing (BSBI) algorithm and Sorted Postings In Memory Indexing (SPIMI) algorithm for constructing indexes. It also discusses distributed and dynamic indexing. The document uses the Reuters RCV1 collection as an example dataset and provides statistics about its size and structure. It discusses challenges with scaling simple sorting-based indexing to very large collections due to disk I/O performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Introduction to Information Retrieval

http://informationretrieval.org

IIR 4: Index Construction

Hinrich Schütze

Center for Information and Language Processing, University of Munich

2014-04-16

1 / 54
Overview

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

2 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

3 / 54
Dictionary as array of fixed-width entries

term document pointer to


frequency postings list
a 656,265 −→
aachen 65 −→
... ... ...
zulu 221 −→
space needed: 20 bytes 4 bytes 4 bytes

4 / 54
B-tree for looking up entries in array

5 / 54
Wildcard queries using a permuterm index

Queries:
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up
Y$X*

6 / 54
k-gram indexes for spelling correction: bordroom

bo ✲ aboard ✲ about ✲ boardroom ✲ border

or ✲ border ✲ lord ✲ morbid ✲ sordid

rd ✲ aboard ✲ ardent ✲ boardroom ✲ border

7 / 54
Levenshtein distance for spelling correction
LevenshteinDistance(s1 , s2 )
1 for i ← 0 to |s1 |
2 do m[i , 0] = i
3 for j ← 0 to |s2 |
4 do m[0, j] = j
5 for i ← 1 to |s1 |
6 do for j ← 1 to |s2 |
7 do if s1 [i ] = s2 [j]
8 then m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1]}
9 else m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1] + 1}
10 return m[|s1 |, |s2 |]
Operations: insert, delete, replace, copy

8 / 54
Exercise: Understand Peter Norvig’s spelling corrector
import re, collections
def words(text): return re.findall(’[a-z]+’, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(’big.txt’).read()))
alphabet = ’abcdefghijklmnopqrstuvwxyz’
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) gt 1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in
edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or
known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)

9 / 54
Take-away

Two index construction algorithms: BSBI (simple) and SPIMI


(more realistic)
Distributed index construction: MapReduce
Dynamic index construction: how to keep the index
up-to-date as the collection changes

10 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

11 / 54
Hardware basics

Many design decisions in information retrieval are based on


hardware constraints.
We begin by reviewing hardware basics that we’ll need in this
course.

12 / 54
Hardware basics

Access to data is much faster in memory than on disk.


(roughly a factor of 10)
Disk seeks are “idle” time: No data is transferred from disk
while the disk head is being positioned.
To optimize transfer time from disk to memory: one large
chunk is faster than many small chunks.
Disk I/O is block-based: Reading and writing of entire blocks
(as opposed to smaller chunks). Block sizes: 8KB to 256 KB
Servers used in IR systems typically have many GBs of main
memory and TBs of disk space.
Fault tolerance is expensive: It’s cheaper to use many regular
machines than one fault tolerant machine.

13 / 54
Some stats (ca. 2008)
symbol statistic value
s average seek time 5 ms = 5 × 10−3 s
b transfer time per byte 0.02 µs = 2 × 10−8 s
processor’s clock rate 109 s−1
p lowlevel operation (e.g., compare & swap a word) 0.01 µs = 10−8 s
size of main memory several GB
size of disk space 1 TB or more

14 / 54
RCV1 collection

Shakespeare’s collected works are not large enough for


demonstrating many of the points in this course.
As an example for applying scalable index construction
algorithms, we will use the Reuters RCV1 collection.
English newswire articles sent over the wire in 1995 and 1996
(one year).

15 / 54
A Reuters RCV1 document

16 / 54
Reuters RCV1 statistics

N documents 800,000
L tokens per document 200
M terms (= word types) 400,000
bytes per token (incl. spaces/punct.) 6
bytes per token (without spaces/punct.) 4.5
bytes per term (= word type) 7.5
T non-positional postings 100,000,000
Exercise: Average frequency of a term (how many tokens)? 4.5

bytes per word token vs. 7.5 bytes per word type: why the
difference? How many positional postings?

17 / 54
Exercise

Why does this algorithm not scale to very large collections?

18 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

19 / 54
Goal: construct the inverted index

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 ...

Calpurnia −→ 2 31 54 101

..
.
| {z } | {z }
dictionary postings

20 / 54
Index construction in IIR 1: Sort postings in memory
term docID term docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2
=⇒ it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2

21 / 54
Sort-based index construction

As we build index, we parse docs one at a time.


The final postings for any term are incomplete until the end.
Can we keep all postings in memory and then do the sort
in-memory at the end?
No, not for large collections
Thus: We need to store intermediate results on disk.

22 / 54
Same algorithm for disk?

Can we use the same index construction algorithm for larger


collections, but by using disk instead of memory?
No: Sorting very large sets of records on disk is too slow – too
many disk seeks.
We need an external sorting algorithm.

23 / 54
“External” sorting algorithm (using few disk seeks)

We must sort T = 100,000,000 non-positional postings.


Each posting has size 12 bytes (4+4+4: termID, docID, term
frequency).
Define a block to consist of 10,000,000 such postings
We can easily fit that many postings into memory.
We will have 10 such blocks for RCV1.
Basic idea of algorithm:
For each block: (i) accumulate postings, (ii) sort in memory,
(iii) write to disk
Then merge the blocks into one long sorted order.

24 / 54
Merging two blocks

postings
to be merged brutus d2
brutus d3
Block 1 Block 2
caesar d1
brutus d3 brutus d2
caesar d4 merged
caesar d4 caesar d1
julius d1 postings
noble d3 julius d1
killed d2
with d4 killed d2
noble d3
with d4

disk

25 / 54
Blocked Sort-Based Indexing

BSBIndexConstruction()
1 n←0
2 while (all documents have not been processed)
3 do n ← n + 1
4 block ← ParseNextBlock()
5 BSBI-Invert(block)
6 WriteBlockToDisk(block, fn )
7 MergeBlocks(f1 , . . . , fn ; f merged )

26 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

27 / 54
Problem with sort-based algorithm

Our assumption was: we can keep the dictionary in memory.


We need the dictionary (which grows dynamically) in order to
implement a term to termID mapping.
Actually, we could work with term,docID postings instead of
termID,docID postings . . .
. . . but then intermediate files become very large. (We would
end up with a scalable, but very slow index construction
method.)

28 / 54
Single-pass in-memory indexing

Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block – no
need to maintain term-termID mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists
as they occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.

29 / 54
SPIMI-Invert
SPIMI-Invert(token stream)
1 output file ← NewFile()
2 dictionary ← NewHash()
3 while (free memory available)
4 do token ← next(token stream)
5 if term(token) ∈ / dictionary
6 then postings list ← AddToDictionary(dictionary ,term(token))
7 else postings list ← GetPostingsList(dictionary ,term(token))
8 if full (postings list)
9 then postings list ← DoublePostingsList(dictionary ,term(token))
10 AddToPostingsList(postings list,docID(token))
11 sorted terms ← SortTerms(dictionary )
12 WriteBlockToDisk(sorted terms,dictionary ,output file)
13 return output file
Merging of blocks is analogous to BSBI.

30 / 54
SPIMI: Compression

Compression makes SPIMI even more efficient.


Compression of terms
Compression of postings
See next lecture

31 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

32 / 54
Distributed indexing

For web-scale indexing (don’t try this at home!): must use a


distributed computer cluster
Individual machines are fault-prone.
Can unpredictably slow down or fail.
How do we exploit such a pool of machines?

33 / 54
Google data centers (2007 estimates; Gartner)

Google data centers mainly contain commodity machines.


Data centers are distributed all over the world.
1 million servers, 3 million processors/cores
Google installs 100,000 servers each quarter.
Based on expenditures of 200–250 million dollars per year
This would be 10% of the computing capacity of the world!
If in a non-fault-tolerant system with 1000 nodes, each node
has 99.9% uptime, what is the uptime of the system
(assuming it does not tolerate failures)?
Answer: 37%
Suppose a server will fail after 3 years. For an installation of 1
million servers, what is the interval between machine failures?
Answer: less than two minutes

34 / 54
Distributed indexing

Maintain a master machine directing the indexing job –


considered “safe”
Break up indexing into sets of parallel tasks
Master machine assigns each task to an idle machine from a
pool.

35 / 54
Parallel tasks

We will define two sets of parallel tasks and deploy two types
of machines to solve them:
Parsers
Inverters
Break the input document collection into splits (corresponding
to blocks in BSBI/SPIMI)
Each split is a subset of documents.

36 / 54
Parsers

Master assigns a split to an idle parser machine.


Parser reads a document at a time and emits
(term,docID)-pairs.
Parser writes pairs into j term-partitions.
Each for a range of terms’ first letters
E.g., a-f, g-p, q-z (here: j = 3)

37 / 54
Inverters

An inverter collects all (term,docID) pairs (= postings) for


one term-partition (e.g., for a-f).
Sorts and writes to postings lists

38 / 54
Data flow

splits assign master assign


postings

parser a-f g-p q-z inve rter a-f

parser a-f g-p q-z g-p


inve rter

inve rter q-z


parser a-f g-p q-z

segment reduce
map files
phase phase

39 / 54
MapReduce

The index construction algorithm we just described is an


instance of MapReduce.
MapReduce is a robust and conceptually simple framework for
distributed computing . . .
. . . without having to write code for the distribution part.
The Google indexing system (ca. 2002) consisted of a number
of phases, each implemented in MapReduce.
Index construction was just one phase.
Another phase: transform term-partitioned into
document-partitioned index.
Why might a document-partitioned index be preferable?

40 / 54
Index construction in MapReduce
Schema of map and reduce functions
map: input → list(k, v )
reduce: (k,list(v )) → output

Instantiation of the schema for index construction


map: web collection → list(termID, docID)
reduce: (htermID1 , list(docID)i, htermID2 , list(docID)i, . . . ) → (postings list1 , postings list2 , . . . )

Example for index construction


map: d2 : C died. d1 : C came, C c’ed. → (hC, d2 i, hdied,d2 i, hC,d1 i, hcame,d1 i, hC,d1 i, hc’ed,d1 i)
reduce: (hC,(d2 ,d1 ,d1 )i,hdied,(d2 )i,hcame,(d1 )i,hc’ed,(d1 )i) → (hC,(d1:2,d2:1)i,hdied,(d2:1)i,hcame,(d1:1)i,hc’ed,(d1:1)i)

41 / 54
Exercise

What information does the task description contain that the


master gives to a parser?
What information does the parser report back to the master
upon completion of the task?
What information does the task description contain that the
master gives to an inverter?
What information does the inverter report back to the master
upon completion of the task?

42 / 54
Outline

1 Recap

2 Introduction

3 BSBI algorithm

4 SPIMI algorithm

5 Distributed indexing

6 Dynamic indexing

43 / 54
Dynamic indexing

Up to now, we have assumed that collections are static.


They rarely are: Documents are inserted, deleted and
modified.
This means that the dictionary and postings lists have to be
dynamically modified.

44 / 54
Dynamic indexing: Simplest approach

Maintain big main index on disk


New docs go into small auxiliary index in memory.
Search across both, merge results
Periodically, merge auxiliary index into big index
Deletions:
Invalidation bit-vector for deleted docs
Filter docs returned by index using this bit-vector

45 / 54
Issue with auxiliary and main index

Frequent merges
Poor search performance during index merge

46 / 54
Logarithmic merge

Logarithmic merging amortizes the cost of merging indexes


over time.
→ Users see smaller effect on response times.
Maintain a series of indexes, each twice as large as the
previous one.
Keep smallest (Z0 ) in memory
Larger ones (I0 , I1 , . . . ) on disk
If Z0 gets too big (> n), write to disk as I0
. . . or merge with I0 (if I0 already exists) and write merger to
I1 etc.

47 / 54
LMergeAddToken(indexes, Z0 , token)
1 Z0 ← Merge(Z0 , {token})
2 if |Z0 | = n
3 then for i ← 0 to ∞
4 do if Ii ∈ indexes
5 then Zi +1 ← Merge(Ii , Zi )
6 (Zi +1 is a temporary index on disk.)
7 indexes ← indexes − {Ii }
8 else Ii ← Zi (Zi becomes the permanent index Ii .)
9 indexes ← indexes ∪ {Ii }
10 Break
11 Z0 ← ∅

LogarithmicMerge()
1 Z0 ← ∅ (Z0 is the in-memory index.)
2 indexes ← ∅
3 while true
4 do LMergeAddToken(indexes, Z0 , getNextToken())

48 / 54
Binary numbers: I3 I2I1 I0 = 23222120

0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100

49 / 54
Logarithmic merge

Number of indexes bounded by O(log T ) (T is total number


of postings read so far)
So query processing requires the merging of O(log T ) indexes
Time complexity of index construction is O(T log T ).
. . . because each of T postings is merged O(log T ) times.
Auxiliary index: index construction time is O(T 2 ) as each
posting is touched in each merge.
Suppose auxiliary index has size a
a + 2a + 3a + 4a + . . . + na = a n(n+1)
2 = O(n2 )
So logarithmic merging is an order of magnitude more
efficient.

50 / 54
Dynamic indexing at large search engines

Often a combination
Frequent incremental changes
Rotation of large parts of the index that can then be swapped
in
Occasional complete rebuild (becomes harder with increasing
size – not clear if Google can do a complete rebuild)

51 / 54
Building positional indexes

Basically the same problem except that the intermediate data


structures are large.

52 / 54
Take-away

Two index construction algorithms: BSBI (simple) and SPIMI


(more realistic)
Distributed index construction: MapReduce
Dynamic index construction: how to keep the index
up-to-date as the collection changes

53 / 54
Resources

Chapter 4 of IIR
Resources at http://cislmu.org
Original publication on MapReduce by Dean and Ghemawat
(2004)
Original publication on SPIMI by Heinz and Zobel (2003)
YouTube video: Google data centers

54 / 54

You might also like