Chapter 3 Part 1
Chapter 3 Part 1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents for users query
–Effectiveness of the system is measured in terms of precision, recall, …
–Main emphasis: Stemming, stop words removal, weighting schemes,
matching algorithms
–Searching
• Is an online process that scans document corpus to find relevant
documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document set Stop word
tokens
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query
Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not first indexed in some manner or other
– indexing of documents or objects is done in order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language
•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Text Compression
•Text compression is about finding ways to represent the text in
fewer bits or bytes
•Advantages:
–save storage space requirement.
–speed up document transmission time
–Takes less time to search the compressed text
•Common compression methods
–Statistical methods: which requires statistical information about frequency
of occurrence of symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes for high
probabilities
–Adaptive methods: which constructs dictionary in the processing of
compression
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David Huffman,
widely used for text compression, 0 1
multimedia codec and message transmission D4
0 1
•The problem: Given a set of n symbols and
1 D3
their weights (or frequencies), construct a 0
tree structure (a binary tree for binary code)D D2
1
with the objective of reducing memory space
& decoding time per symbol.
Code of:
•Huffman coding is constructed based on
D1 = 000
frequency of occurrence of letters in text D2 = 001
documents D3 = 01
D4 = 1
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities of symbol occurrence
Step 3: WHILE more than one tree exist DO
– Merge two trees t1 and t2 with least probabilities p1 and p2
– Label their root with sum p1 + p2
– Associate binary code: 1 with the right branch and 0 with the left branch
Step 4: Create a unique codeword for each symbol by traversing the tree from the root
to the leaf.
– Concatenate all encountered 0s and 1s together during traversal
• The resulting tree has a prob. of 1 in its root and symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table to
construct the Huffman coding.
Symbol Probability
A 0.05
• The Huffman encoding algorithm
b 0.05
picks each time two symbols
c 0.1 (with the smallest frequency) to
d 0.2 combine
e 0.3
f 0.2
g 0.1
Huffman code tree
1
0 1
0.4 0.6
0
0 1 1
0.3
d f 1 e
0
0.2
1
g 0
0.1
c 0 1
a b
Character: a b c d e t
Frequency: 16 5 12 17 10 25
•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data transmission/data
storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words created
during data transmission;
•each time it replaces strings of characters with a reference to a previous occurrence
of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn. The
prefix of a pattern consists of all the pattern symbols except the
last: C0C1 . . . Cn-1
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing structures: Exercise
• Discuss in detail theoretical and algorithmic concepts (including
construction, various operations, complexity, etc.) on the following
commonly used data structures
Token Tokenizer
stream. Friends Romans countrymen
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms and a link to
one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the keyword
•An index file is list of search terms that are organized for associative look-up, i.e.,
to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several occurrences.)
•For organizing index file for a collection of documents, there are various options
available:
–Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix
array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)
• Space overhead
–Computer storage space consumed.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all documents did
enact
1
1
Sequential file
have been julius 1
Doc
caesar 1
tokenized, stopwords I 1 Term No.
are removed, and was
killed
1
1 1 ambition 2
normalization and I 1
2 brutus 1
the 1
stemming are capitol 1
3 brutus 2
applied, to generate brutus
killed
1
1 4 capitol 1
index terms me 1
so 2 5 caesar 1
• These index terms in let 2
6 caesar 2
it 2
sequential file are be 2 7 caesar 2
sorted in with
caesar
2
2 8 enact 1
alphabetical order the 2
9 julius 1
noble 2
brutus 2 10 kill 1
hath 2
told 2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using
binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added. Inserting a
new record may require moving a large proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence, the
newest entries are added at the end of the file & therefore pose no
difficulty to updating. But searching becomes very tough; it requires linear
time
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
–Building and maintaining an inverted index is a relatively low cost risk. On a
text of n words an inverted index can be built in O(n) time, n is number of
keywords
•Why location?
– Having information about the location of each term within the
document helps for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations are
19 2 7, 212 performed before
building the index.
22 1 56
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
•An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
•Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even
for large text collection, and
– Posting file to be kept on disk for accessing to documents
Inverted index storage
•Separation of inverted file into vocabulary and posting file is a
good idea.
–Vocabulary: For searching purpose we need only word list. This allows
the vocabulary to be kept in memory at search time since the space
required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words.
Hence, the size of index is 100 MBs, which can easily be held in memory of a
dedicated computer.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents have julius 1 caesar 1
caesar 2
been tokenized caesar
I
1
1 caesar 2
the inverted file was
killed
1
1
did
enact
1
1
is sorted by I 1 has 1
the 1
terms capitol 1
I
I
1
1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term
frequency
•Multiple term Term Doc #
entries in a single ambition 2 Term Doc # TF
document are brutus 1 ambition 2 1
merged and brutus 2 brutus 1 1
frequency brutus 2 1
capitol 1
information capitol 1 1
caesar 1
added caesar 1 1
caesar 2
•Counting caesar 2 caesar 2 2
number of enact 1 enact 1 1
occurrence of julius 1 julius 1 1
terms in the kill 1 kill 1 2
collections helps kill 1 noble 2 1
to compute TF
noble 2
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Complexity Analysis
• The inverted index can be built in O(n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted searching
takes logarithmic time.
• To update the inverted index it is possible to apply
Incremental indexing which requires O(k) time, k is
number of new index terms
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise