Lecture 5p1 - Index Construction & Compressing
Lecture 5p1 - Index Construction & Compressing
Lecture 5 part 1.
Index Construction & Compressing
Information Retrieval and Analysis
Vasily Sidorov
1
Let’s Recall
Term TermID
friend 1
roman 2
countryman 3
Can also be positional
lend 4
i 5 TermID Freq. Postings List (DocIDs)
you 6 → 1 → 5 → 6 → 12
1 4
ear 7
5 2 → 1 → 8
Dictionary 7 6 → 1 → 2 → 6 → 8 → 12 → 13
Inverted Index
3
Sec. 4.1
Hardware basics
• Many design decisions in information retrieval are
based on the characteristics of hardware
5
Hardware basics
6
Sec. 4.1
Hardware basics
• Access to data in memory (RAM) is much faster
than access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
7
Hardware basics
• Solid State Drives (SSD) are mitigating some of the
problems:
—~100 times faster access time than HDDs
—1-2 order of magnitude faster I/O than HDDs
—Possesses a property of Random Access — (almost)
no “disk seek” delay
• Still much slower than RAM
• Still reads/writes in blocks
• Still too expensive compared to HDD
8
Sec. 4.1
Hardware basics
• Servers used in IR systems now typically have
hundreds of GB of main memory, sometimes
several TB
9
Sec. 4.1
10
Sec. 4.2
12
Sec. 4.2
4.5 bytes per token vs. 7.5 bytes per term: why? 13
Sec. 4.2
16
Sec. 4.2
17
Sec. 4.2
18
Sec. 4.2
Bottleneck
• Parse and build postings entries one doc at a time
• Now sort postings entries by term (then by doc
within each term)
• Doing this with random disk seeks would be too
slow – must sort T=100M records
19
Sec. 4.2
20
21
Sec. 4.2
22
Sec. 4.2
23
Sec. 4.2
1
1 2
2 Merged run.
3 4
3
4
Runs being
merged.
Disk
24
Sec. 4.2
25
Sec. 4.3
26
Sec. 4.3
SPIMI:
Single-Pass In-Memory Indexing
• Key idea 1: Generate separate dictionaries for each
block – no need to maintain term-termID mapping
across blocks.
• Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
• With these two ideas we can generate a complete
inverted index for each block.
• These separate indexes can then be merged into
one big index.
27
Sec. 4.3
SPIMI-Invert
SPIMI: Compression
• Compression makes SPIMI even more efficient.
—Compression of terms
—Compression of postings
• We’ll discuss later today
29
Sec. 4.4
Distributed indexing
• For web-scale indexing (don’t try this at home!):
must use a distributed computing cluster
30
Sec. 4.4
31
Google Data Center in Jurong West
32
Sec. 4.4
33
Sec. 4.4
Distributed indexing
• Maintain a master machine directing the indexing
job – considered “safe”.
• Break up indexing into sets of (parallel) tasks.
• Master machine assigns each task to an idle
machine from a pool.
34
Sec. 4.4
Parallel tasks
• We will use two sets of parallel tasks
—Parsers
—Inverters
• Break the input document collection into splits
• Each split is a subset of documents (corresponding
to blocks in BSBI/SPIMI)
35
Sec. 4.4
Parsers
• Master assigns a split to an idle parser machine
• Parser reads one document at a time and emits
(term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first letters
—(e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
36
Sec. 4.4
Inverters
• An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
• Sorts and writes to postings lists
37
Sec. 4.4
Data flow
Map Reduce
Segment files
phase phase 38
Sec. 4.4
MapReduce
• The index construction algorithm we just described
is an instance of MapReduce.
• MapReduce (Dean & Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing …
• … without having to write code for the distribution
part.
• They describe the Google indexing system (ca.
2002) as consisting of a number of phases, each
implemented in MapReduce.
39
Sec. 4.4
MapReduce
• Index construction was just one phase.
• Another phase: transforming a term-partitioned
index into a document-partitioned index.
—Term-partitioned: one machine handles a subrange
of terms
—Document-partitioned: one machine handles a
subrange of documents
• As we’ll discuss in the web part of the course, most
search engines use a document-partitioned index
for better load balancing, etc.
40
Sec. 4.5
Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
—Documents come in over time and need to be
inserted.
—Documents are deleted and modified.
• This means that the dictionary and postings lists
have to be modified:
—Postings updates for terms already in dictionary
—New terms added to dictionary
◦ #MoonbyulxPunch
43
Sec. 4.5
Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
—Invalidation bit-vector for deleted docs
—Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
44
Issues with multiple indexes
• Collection-wide statistics are hard to maintain
—E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to the
user?
—We said, pick the one with the most hits
• How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
—One possibility: ignore everything but the main
index for such ordering
• Will see more such statistics used in results ranking
49
Dynamic indexing at search engines
• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental changes
—News, blogs, twitter, new topical web pages
• But (sometimes/typically) they also periodically
reconstruct the index from scratch
—Query processing is then switched to the new index,
and the old index is deleted
50