2
2
Information
Retrieval
Term-document incidence matrices
Sec.
1.1
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) ➔
bitwise AND.
– 110100 AND
– 110111 AND Antony
Antony and Cleopatra
1
Julius Caesar
1
The Tempest
0
Hamlet
0
Othello
0
Macbeth
1
– 101111 =
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
– 100100
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
4
Sec.
1.1
Answers to query
• Antony and Cleopatra, Act III,
Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why,
Enobarbus, When Antony found Julius
Caesar dead,
He cried almost to roaring; and he wept When at
Philippi he found Brutus slain.
5
Sec.
1.1
Bigger collections
• Consider N = 1 million documents, each with about
1000 words.
• Avg 6 bytes/word including
spaces/punctuation
– 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
6
Sec.
1.1
7
Introduction to
Information
Retrieval
The Inverted Index
The key data structure underlying
modern IR
Inverted index Sec.
• It is an index data structure storing a mapping content, such as 1.2
Caesar 1 2 4 5 6 16 57
Calpurn 132
ia 2 31 54101
What happens if the word
Caesar is added to document
14? 16
Sec.
1.2
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal
and best
– In memory, can use linked lists or variable l Posting
arrays• Some tradeoffs in size/ease of insertion engt
h
Brutu 1 2 4 11 31 45
s
Caes 173 174
1 2 4 5 6 16 57
ar
Calpurni 132
a
2 31 54101
Dictionary Postin
gs
Sorted by docID (more later on why).
17
Sec.
1.2
Linguistic modules
Doc Doc
1 2
I did enact So let it be with
Julius Caesar I Caesar. The noble
was killed i’ the Brutus hath told
Capitol; Brutus you Caesar was
killed me. ambitious
Sec.
1.2
Core indexing
step
Sec.
1.2
Why frequency?
Will discuss later.
Introduction to
Information
Retrieval
Query processing with an inverted index
Sec.
1.3
17
Sec.
1.3
The merge
• Walk through the two postings
simultaneously, in time linear in the total
number of postings entries
2 4 8
32 64 128
16
Brutus
1 2 3 5
8 13 21 34 Caesar
If the list lengths are x and y, the merge takes
O(x+y) operations.
Crucial: postings sorted by docID.
19
Intersecting two postings lists (a “merge”
algorithm)
20
Introduction to
Information
Retrieval
The Boolean Retrieval Model
& Extended Boolean
Models
Sec.
1.3
25
Merging Sec.
1.3
26
Sec.
1.3
Query optimization
• What is the best order for query
processing?
• Consider a query that is an AND
of n terms.
• For each of the n terms, get its
Brutuspostings, then AND them 16
together 32
Caesa
64 128
r
Calpurni 1 2 3 5 8 16 21
a 34 Calpurnia AND
Query: Brutus AND 3
Caesar 13 16 5
Sec.
1.3
Brutu 2 4 8 16 32 64 128
s
Caesa 1 2 3 5 8 16 21
r 34
Calpurni 13 16
a Execute
the query as (Calpurnia AND Brutus) AND
Caesar. 28
Sec.
1.3
29
Exercise
• Recommend a query
processing order for
30
Query processing exercises
• Exercise: If the query is friends AND romans
AND (NOT countrymen), how could we use the
freq of countrymen?
• Exercise: Extend the merge to an arbitrary
Boolean query. Can we always guarantee
execution in time linear in the total postings size?
• Hint: Begin with the case of a Boolean formula
query: in this, each query term appears only once
in the query.
31
Exercise
• Try the search feature at http://www.rhy
mezone.com/shakespeare/
• Write down five search features you think it
could do better
32
Introduction to
Information
Retrieval
Phrase queries and positional indexes
Phrase queries Sec.
• A phrase query is a type of search query where 2.4
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
Which of docs 1,2,4,5
2: 3, 149; could contain “to be
4: 17, 191, 291, 430, 434; or not to be”?
Proximity queries
• LIMIT! /3 STATUTE /3 FEDERAL /2 TORT
– Again, here, /k means “within k words of”.
• Clearly, positional indexes can be used for
such queries; biword indexes cannot.
Sec.
2.4.2
•Term Storage: For each term, the index stores its occurrences
across all documents.
Rules of thumb
• A positional index is 2–4 as large as a non-
positional index
Combination schemes
• These two approaches can be profitably
combined
– For particular phrases (“M ichael Jackson”,
“Britney Spears”) it is inefficient to keep on
merging positional postings lists
• Even more so for phrases like “The Who”
• Williams et al. (2004) evaluate a more
sophisticated mixed indexing scheme
– A typical web query mixture was executed in ¼ of the
time of using just a positional index
– It required 26% more space than having a positional
index alone
Key Idea of the Hybrid Approach
Instead of relying solely on either biword or positional indexes, a
combination scheme strategically uses: