ISR Chap...4
ISR Chap...4
Indexing Basics
1
Text Collections and IR
• Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
Sample Statistics of Text Collections
• Dialog:
–claims to have more than 15 terabytes of data in >600 Databases, >
800 million unique records
• LEXIS/NEXIS:
–claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers,
11,400 databases; >200,000 searches per day
• Web Search Engines:
–Google claim to index over 1.5 billion pages.
–How many search engines are available these days?
2
Designing an IR System
Our focus during IR system design is:
• In improving performance effectiveness of the system
–Effectiveness of the system is measured in terms of precision,
recall, …
–Stemming, stop-words, weighting schemes, matching algorithms
• In improving performance efficiency.
The concern here is
–storage space usage, access time, …
–Compression, data/file structures, space – time tradeoffs
• The two subsystems of an IR system:
–Searching and
–Indexing
3
Indexing Subsystem
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
4
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
5
Basic assertion
Indexing and searching:
– you cannot search that was not first indexed in some
manner or other
– indexing of documents or objects is done in order to
be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language
Knowing searching is knowing indexing
6
Implementation Issues
•Storage of text:
–The need for text compression: to reduce storage space
•Indexing text
–Organizing indexes
–Storage of indexes
•Accessing text
7
Text Compression
• Text compression is about finding ways to represent the
text in fewer bits or bytes
• Advantages: Disadvantage
The time required to decode
–save storage space requirement.
and encode the text.
–speed up document transmission time
–Takes less time to search the compressed text
• Common compression methods
–Static methods: which requires statistical information about
frequency of occurrence of symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes
for high probabilities
–Adaptive methods: which constructs dictionary in the processing
of compression
E.g. Lempel-Ziv Compression (LZ):
•Replace words or symbols with a pointer to dictionary entries 8
Huffman coding
•Developed in 1950s by David Huffman,
widely used for text compression and 0 1
message transmission D4
0 1
•The problem: Given a set of n symbols
1 D3
and their weights (or frequencies), 0
construct a tree structure (a binary tree D D2
1
for binary code) with the objective of
reducing memory space & decoding
Code of:
time per symbol.
D1 = 000
•Huffman coding is constructed based D2 = 001
on frequency of occurrence of letters D3 = 01
in text documents D4 = 1
9
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities
of symbol occurrence
Step 3: WHILE more than one tree exist DO
–Merge two trees t1 and t2 with least probabilities p1 and p2
–Label their root with sum p1 + p2
–Associate binary code: 1 with the right branch and 0 with the left
branch
Step 4: Create a unique code word for each symbol by
traversing the tree from the root to the leaf.
–Concatenate all encountered 0s and 1s together during traversal
• The resulting tree has a prob. of 1 in its root and symbols in
its leaf node.
10
Example
• Consider a 7-symbol alphabet given in the following table
to construct the Huffman coding.
Symbol Probability
a 0.05
• The Huffman encoding
b 0.05
algorithm picks each time two
c 0.1 symbols (with the smallest
d 0.2 frequency) to combine
e 0.3
f 0.2
g 0.1
11
Huffman code tree
1
0 1
0.4 0.6
0
0 1 1
0.3
d f 1 e
0
0.2
1
g 0
0.1
c 0 1
a b
Character: a b c d e t
Frequency: 16 5 12 17 10 25
13
Ziv-Lempel compression
•The problem with Huffman coding is that it requires
knowledge about the data before encoding takes place.
–Huffman coding requires frequencies of symbol occurrence
before code word is assigned to symbols
•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data
transmission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
•each time it replaces strings of characters with a reference to a previous
occurrence of the string.
14
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-
1Cn. The prefix of a pattern consists of all the pattern
symbols except the last: C0C1 . . . Cn-1
15
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB
using the LZ algorithm.
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
18
Indexing
Some concept about indexing
• Indexing until recently ,was accomplished by created
bibliographic citation in structure file that reference
the original text.
• Bibliographic citation: This is a formal description of a
source of information (such as a book, article, or
document) that typically includes details like the
author’s name, title, publication date, and publisher. It
serves as a reference to the original work.
• Automatic indexing is the capability for a system
automatically determines index term to be assigned
an item.
19
Con…
•An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has the
index term
–A good index file maps each keyword Ki to a set of documents Di that
contain the keyword
20
Sequential File
•Sequential file is the most primitive file structures.
–It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
–a particular attribute is chosen as primary key whose value
will determine the order of the records.
–when the first key fails to discriminate among records, a
second key is chosen to give an order.
21
Sequential File
• To access records search serially;
– starting at the first record read and investigate all the
succeeding records until the required record is found
or end of the file is reached.
22
Sequential File
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new
term is added.
– Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
25
Inverted file
•Having information about vocabulary (list of terms)
–When a system knows the vocabulary, it can quickly match user
queries to the appropriate documents that contain relevant terms,
–speeds searching for relevant documents
•Having information about the location of each term within the
document helps for:
–user interface design: highlight location of search term
•Having information about frequency is used for:
•calculating term weighting (like TF, TF*IDF, …)
•`optimizing query processing
26
Inverted File
Documents are organized by the terms/words they contain
This is called an
Word Tot Freq Document Term Location
index file.
Freq
Act 3 2 1 66
Text operations
19 1 213 are performed
29 1 45 before building
bus 4 3 1 94 the index.
19 2 7, 212 Location The
character or
22 1 56
word position
Pen 1 5 1 43 within the
total 3 11 2 3, 70 document.
34 1 40 (e.g., the 5th
word, the 12th
character). 27
Construction of Inverted file
An inverted index consists of two files: vocabulary
and posting files
•A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in
any of the documents (in lexicographical order) and
–For each word a pointer to posting file
•Records kept for each term j in the word list
contains the following:
–term j
–Frequency of a term in a given document
–number of documents in which term j occurs (nj)
–Total frequency of term j
–pointer to inverted (postings) list for term j 28
Postings File (Inverted List)
• For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
• Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
• Each list consists of one or many individual postings
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in the posting
file allows:
–the vocabulary to be kept in memory at search time even for large
text collection, and
–Posting file to be kept on disk for accessing to document
29
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
30
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.
So let it be with
Doc 2 Caesar. The noble
Brutus hath told you
Caesar was ambitious 31
• After all
Sorting the Vocabulary
Term Doc #
Term Doc #
documents have I 1 ambitious 2
been parsed the did 1 be 2
enact 1 brutus 1
inverted file is julius 1 brutus 2
sorted by terms caesar 1 capitol 1
– Inverted I
was
1
1
caesar 1
caesar 2
index may killed 1
caesar 2
I 1
record term the 1
did 1
locations capitol 1 enact 1
has 1
within brutus 1
I 1
killed 1
document me 1 I 1
during so 2 I 1
let 2 it 2
parsing it 2 julius 1
be 2 killed 1
with 2
killed 1
caesar 2
let 2
the 2
noble 2 me 1
brutus 2 noble 2
hath 2 so 2
told 2 the 1
you 2 the 2
caesar 2 told 2
was 2 you 2
ambitious 2 was 1 32
was 2
with 2
Remove stop-words, apply stemming
& compute term frequency
•Multiple term
entries in a
single Term Doc #
Term Doc # TF
document are ambition 2
ambition 2 1
merged and brutus 1
brutus 1 1
frequency brutus 2
brutus 2 1
information capitol 1
capitol 1 1
added caesar 1
caesar 1 1
•Counting caesar 2
caesar 2 2
number of caesar 2
enact 1 1
occurrence of enact 1
julius 1 1
terms in the julius 1
kill 1 2
collections kill 1
noble 2 1
helps to kill 1
compute TF noble 2
33
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF
Doc # TF
ambition 2 1 2 1
brutus 1 1 1 1
brutus 2 1 2 1
capitol 1 1 1 1
caesar 1 1 1 1
caesar 2 2 2 2
enact 1 1 1 1
julius 1 1 1 1
kill 1 2 1 2
noble 2 1 2 1
Pointers
Inverted index storage
•Separation of inverted file into vocabulary and posting file
is a good idea.
–Vocabulary: For searching purpose we need only word list. This
allows the vocabulary to be kept in memory at search time since the
space required for the vocabulary is small.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct
words.
–Posting file requires much more space.
• For each word appearing in the text we are keeping statistical information
related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space
35
Suffix Tree
36
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i; 37
Suffix trie
A suffix trie is an ordinary trie in which the input strings are
all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (#of symbols
in text).
– To build the suffix TRIE we use these indices instead of the actual
object.
The structure has several advantages:
–It requires less storage space.
Since the suffix trie stores suffixes, common prefixes among
these suffixes are shared in the structure. For example, the
suffixes "anana", "ana", and "a" share the prefix "a."
–We do not have to store the same object twice (no duplicate).
For instance, in the word "banana", the suffix "ana" appears more
38
than once but is only stored once in the trie.
Suffix Trie
Construct suffix trie for the following string: GOOGOL
We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
TEXT: GOOGOL$
POSITION: 1 2 3 4 5 6 7
Build a SUFFIX TRIE for all n suffixes of the text.
Note: The resulting tree has n leaves and height n.
This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
39
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S
–The suffix tree is created by O
compacting unary nodes
(nodes with only one child) of
the suffix TRIE.
• We store pointers rather than
words in the leaves.
–It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
40
(7,7) for $
Example: Suffix tree
Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
We label each leaf with the
starting point of the
{ corresponding suffix.
$ $
b$ b 5
ab$ ab
bab$ $
abab$
} $ ab$ 4
ab$
3
2
1
41
Generalized suffix tree
Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s S
To make suffixes prefix-free we add a special char, $, at the end of s.
To associate each suffix with a unique string in S add a different special
symbol to each s
Avoids overlap help in distinguishing end of one suffix from another
Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
{ a $ #
b
$ #
# 5 4
b$ b#
b
ab$ ab# ab$ ab$ $
bab$ aab# 3
abab$ ab$ # 4
$ 1 2
} 1 2
42
3
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
–If S encountered a NIL pointer before reaching the end, then S is
not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
43