04const Flat
04const Flat
http://informationretrieval.org
Hinrich Schütze
2014-04-16
1 / 54
Overview
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
2 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
3 / 54
Dictionary as array of fixed-width entries
4 / 54
B-tree for looking up entries in array
5 / 54
Wildcard queries using a permuterm index
Queries:
For X, look up X$
For X*, look up X*$
For *X, look up X$*
For *X*, look up X*
For X*Y, look up
Y$X*
6 / 54
k-gram indexes for spelling correction: bordroom
7 / 54
Levenshtein distance for spelling correction
LevenshteinDistance(s1 , s2 )
1 for i ← 0 to |s1 |
2 do m[i , 0] = i
3 for j ← 0 to |s2 |
4 do m[0, j] = j
5 for i ← 1 to |s1 |
6 do for j ← 1 to |s2 |
7 do if s1 [i ] = s2 [j]
8 then m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1]}
9 else m[i , j] = min{m[i − 1, j] + 1, m[i , j − 1] + 1, m[i − 1, j − 1] + 1}
10 return m[|s1 |, |s2 |]
Operations: insert, delete, replace, copy
8 / 54
Exercise: Understand Peter Norvig’s spelling corrector
import re, collections
def words(text): return re.findall(’[a-z]+’, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(’big.txt’).read()))
alphabet = ’abcdefghijklmnopqrstuvwxyz’
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) gt 1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in
edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or
known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
9 / 54
Take-away
10 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
11 / 54
Hardware basics
12 / 54
Hardware basics
13 / 54
Some stats (ca. 2008)
symbol statistic value
s average seek time 5 ms = 5 × 10−3 s
b transfer time per byte 0.02 µs = 2 × 10−8 s
processor’s clock rate 109 s−1
p lowlevel operation (e.g., compare & swap a word) 0.01 µs = 10−8 s
size of main memory several GB
size of disk space 1 TB or more
14 / 54
RCV1 collection
15 / 54
A Reuters RCV1 document
16 / 54
Reuters RCV1 statistics
N documents 800,000
L tokens per document 200
M terms (= word types) 400,000
bytes per token (incl. spaces/punct.) 6
bytes per token (without spaces/punct.) 4.5
bytes per term (= word type) 7.5
T non-positional postings 100,000,000
Exercise: Average frequency of a term (how many tokens)? 4.5
bytes per word token vs. 7.5 bytes per word type: why the
difference? How many positional postings?
17 / 54
Exercise
18 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
19 / 54
Goal: construct the inverted index
Calpurnia −→ 2 31 54 101
..
.
| {z } | {z }
dictionary postings
20 / 54
Index construction in IIR 1: Sort postings in memory
term docID term docID
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i’ 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
me 1 i’ 1
so 2
=⇒ it 2
let 2 julius 1
it 2 killed 1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
21 / 54
Sort-based index construction
22 / 54
Same algorithm for disk?
23 / 54
“External” sorting algorithm (using few disk seeks)
24 / 54
Merging two blocks
postings
to be merged brutus d2
brutus d3
Block 1 Block 2
caesar d1
brutus d3 brutus d2
caesar d4 merged
caesar d4 caesar d1
julius d1 postings
noble d3 julius d1
killed d2
with d4 killed d2
noble d3
with d4
disk
25 / 54
Blocked Sort-Based Indexing
BSBIndexConstruction()
1 n←0
2 while (all documents have not been processed)
3 do n ← n + 1
4 block ← ParseNextBlock()
5 BSBI-Invert(block)
6 WriteBlockToDisk(block, fn )
7 MergeBlocks(f1 , . . . , fn ; f merged )
26 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
27 / 54
Problem with sort-based algorithm
28 / 54
Single-pass in-memory indexing
Abbreviation: SPIMI
Key idea 1: Generate separate dictionaries for each block – no
need to maintain term-termID mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists
as they occur.
With these two ideas we can generate a complete inverted
index for each block.
These separate indexes can then be merged into one big index.
29 / 54
SPIMI-Invert
SPIMI-Invert(token stream)
1 output file ← NewFile()
2 dictionary ← NewHash()
3 while (free memory available)
4 do token ← next(token stream)
5 if term(token) ∈ / dictionary
6 then postings list ← AddToDictionary(dictionary ,term(token))
7 else postings list ← GetPostingsList(dictionary ,term(token))
8 if full (postings list)
9 then postings list ← DoublePostingsList(dictionary ,term(token))
10 AddToPostingsList(postings list,docID(token))
11 sorted terms ← SortTerms(dictionary )
12 WriteBlockToDisk(sorted terms,dictionary ,output file)
13 return output file
Merging of blocks is analogous to BSBI.
30 / 54
SPIMI: Compression
31 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
32 / 54
Distributed indexing
33 / 54
Google data centers (2007 estimates; Gartner)
34 / 54
Distributed indexing
35 / 54
Parallel tasks
We will define two sets of parallel tasks and deploy two types
of machines to solve them:
Parsers
Inverters
Break the input document collection into splits (corresponding
to blocks in BSBI/SPIMI)
Each split is a subset of documents.
36 / 54
Parsers
37 / 54
Inverters
38 / 54
Data flow
segment reduce
map files
phase phase
39 / 54
MapReduce
40 / 54
Index construction in MapReduce
Schema of map and reduce functions
map: input → list(k, v )
reduce: (k,list(v )) → output
41 / 54
Exercise
42 / 54
Outline
1 Recap
2 Introduction
3 BSBI algorithm
4 SPIMI algorithm
5 Distributed indexing
6 Dynamic indexing
43 / 54
Dynamic indexing
44 / 54
Dynamic indexing: Simplest approach
45 / 54
Issue with auxiliary and main index
Frequent merges
Poor search performance during index merge
46 / 54
Logarithmic merge
47 / 54
LMergeAddToken(indexes, Z0 , token)
1 Z0 ← Merge(Z0 , {token})
2 if |Z0 | = n
3 then for i ← 0 to ∞
4 do if Ii ∈ indexes
5 then Zi +1 ← Merge(Ii , Zi )
6 (Zi +1 is a temporary index on disk.)
7 indexes ← indexes − {Ii }
8 else Ii ← Zi (Zi becomes the permanent index Ii .)
9 indexes ← indexes ∪ {Ii }
10 Break
11 Z0 ← ∅
LogarithmicMerge()
1 Z0 ← ∅ (Z0 is the in-memory index.)
2 indexes ← ∅
3 while true
4 do LMergeAddToken(indexes, Z0 , getNextToken())
48 / 54
Binary numbers: I3 I2I1 I0 = 23222120
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
49 / 54
Logarithmic merge
50 / 54
Dynamic indexing at large search engines
Often a combination
Frequent incremental changes
Rotation of large parts of the index that can then be swapped
in
Occasional complete rebuild (becomes harder with increasing
size – not clear if Google can do a complete rebuild)
51 / 54
Building positional indexes
52 / 54
Take-away
53 / 54
Resources
Chapter 4 of IIR
Resources at http://cislmu.org
Original publication on MapReduce by Dean and Ghemawat
(2004)
Original publication on SPIMI by Heinz and Zobel (2003)
YouTube video: Google data centers
54 / 54