03 -Lect3 search engines-part2
03 -Lect3 search engines-part2
Retrieval (CS418)
Search Engine Architecture [2]
Lecture 3
what data
do we want? a lookup table for
quickly finding all docs
Text containing a word
transformation
format conversion. international?
which part contains “meaning”? © Addison Wesley, 2008
• This index can only determine whether a word exists within a particular
document, since it stores no information regarding the frequency and
position of the word; it is therefore considered to be a boolean index.
These examples/figures are from: Manning, Raghavan, Schütze, Intro to Information Retrieval, CUP, 2008
The Shakespeare collection as
Term-Document Matrix
QUERY:
Brutus AND Caesar AND NOT Calpurnia
Answer: “Antony and Cleopatra”(d=1), “Hamlet”(d=4)
Inverted Index Data Structure
term (t) document id (d), e.g. “Brutus” occurs in d=1, 2, 4...
Importantly, it’s sorted list
Inverted Index
• Each index term is associated with an
inverted list
– Contains lists of documents, or lists of word
occurrences in documents, and other
information
– Each entry is called a posting
– The part of the posting that refers to a specific
document or location is called a pointer
– Each document in the collection is given a
unique number
– Lists are usually document-ordered (sorted by
document number)
25
Sec. 1.2
Inverted index
Inverted index
Dictionary Postings
Sorted by docID (more later on why).
28
Sec. 1.2
Doc 1 Doc 2
Why frequency?
Will discuss later.
Sec. 1.2
Final inverted index
Lists of
docIDs
Terms
and
counts IR system
implementation
• How do we index
efficiently?
• How much storage
do we need?
Pointers
32