0% found this document useful (0 votes)
46 views

Chapter 3 Part 1

The document discusses improving the effectiveness and efficiency of information retrieval systems. It describes the indexing and searching subsystems, with indexing involving organizing documents with extracted keywords offline and searching being the online process of finding relevant documents for a user's query. Various techniques for indexing texts, accessing texts, and processing indexes are also covered.

Uploaded by

S J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Chapter 3 Part 1

The document discusses improving the effectiveness and efficiency of information retrieval systems. It describes the indexing and searching subsystems, with indexing involving organizing documents with extracted keywords offline and searching being the online process of finding relevant documents for a user's query. Various techniques for indexing texts, accessing texts, and processing indexes are also covered.

Uploaded by

S J
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Indexing structure

Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents for users query
–Effectiveness of the system is measured in terms of precision, recall, …
–Main emphasis: Stemming, stop words removal, weighting schemes,
matching algorithms

• In improving Efficiency of the system


–The concern here is reducing storage space requirement, enhancing
searching time, indexing time, access time…
–Main emphasis: Compression, indexing structures, space – time tradeoffs
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
• is an offline process of organizing documents using keywords
extracted from the collection
• Indexing is used to speed up access to desired information from
document collection as per users query

–Searching
• Is an online process that scans document corpus to find relevant
documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document set Stop word
tokens
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query
Term weighting
Measure terms

Index terms
Index
Basic assertion
Indexing and searching:
inexorably connected

– you cannot search that that was not first indexed in some manner or other
– indexing of documents or objects is done in order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language

Knowing searching is knowing indexing


Implementation Issues
•Storage of text:
–The need for text compression: to reduce storage space

•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?

•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Text Compression
•Text compression is about finding ways to represent the text in
fewer bits or bytes
•Advantages:
–save storage space requirement.
–speed up document transmission time
–Takes less time to search the compressed text
•Common compression methods
–Statistical methods: which requires statistical information about frequency
of occurrence of symbols in the document
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter codes for high
probabilities
–Adaptive methods: which constructs dictionary in the processing of
compression
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David Huffman,
widely used for text compression, 0 1
multimedia codec and message transmission D4
0 1
•The problem: Given a set of n symbols and
1 D3
their weights (or frequencies), construct a 0
tree structure (a binary tree for binary code)D D2
1
with the objective of reducing memory space
& decoding time per symbol.
Code of:
•Huffman coding is constructed based on
D1 = 000
frequency of occurrence of letters in text D2 = 001
documents D3 = 01
D4 = 1
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling probabilities of symbol occurrence
Step 3: WHILE more than one tree exist DO
– Merge two trees t1 and t2 with least probabilities p1 and p2
– Label their root with sum p1 + p2
– Associate binary code: 1 with the right branch and 0 with the left branch
Step 4: Create a unique codeword for each symbol by traversing the tree from the root
to the leaf.
– Concatenate all encountered 0s and 1s together during traversal

• The resulting tree has a prob. of 1 in its root and symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following table to
construct the Huffman coding.

Symbol Probability
A 0.05
• The Huffman encoding algorithm
b 0.05
picks each time two symbols
c 0.1 (with the smallest frequency) to
d 0.2 combine
e 0.3
f 0.2
g 0.1
Huffman code tree
1
0 1
0.4 0.6
0
0 1 1
0.3
d f 1 e
0
0.2
1
g 0
0.1

c 0 1
a b

• Using the Huffman coding a table can be constructed by working down


the tree, left to right. This gives the binary equivalents for each symbol
in terms of 1s and 0s.
• What is the Huffman binary representation for ‘café’?
Exercise
1. Given the following, apply the Huffman algorithm to find
an optimal binary code:

Character: a b c d e t

Frequency: 16 5 12 17 10 25

2. Given text: “for each rose, a rose is a rose”


– Construct the Huffman coding
Ziv-Lempel compression
•The problem with Huffman coding is that it requires knowledge
about the data before encoding takes place.
–Huffman coding requires frequencies of symbol occurrence before
codeword is assigned to symbols

•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data transmission/data
storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-words created
during data transmission;
•each time it replaces strings of characters with a reference to a previous occurrence
of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn. The
prefix of a pattern consists of all the pattern symbols except the
last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code to


each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign
(dictionaryPrefixIndex, lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary, asign
(dictionaryPrefixIndex, )
Example: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the
LZ algorithm.

The compressed message is: 0A0B2C3A2A4A6B


Example: Decompression
Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:


ABBCBCABABCAABCAAB
Exercise
Encode (i.e., compress) the following strings using the Lempel-Ziv
algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
Indexing structures: Exercise
• Discuss in detail theoretical and algorithmic concepts (including
construction, various operations, complexity, etc.) on the following
commonly used data structures

1. Data structure vs. file structure


2. Arrays (fixed and dynamic arrays), sorted arrays
3. Records and linked list
4. Tree (AVL tree, Binary tree): balanced vs. unbalanced tree)
5. B tree and its variants (B+, B++ Tree, B* Tree),
6. Hierarchical Tree (like Quad Tree and its variants)
7. PAT-Tree and its variants
8. Disjoint tree: balanced and degenerate tree
9. Graph
10.. Hashing,
11. Trie and its variants
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information
from document collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant documents
are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index entries.
• Index files are much smaller than the original file.
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary
is only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing (like
stemming & other normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is represented by a list
of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into their stem/root
word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance when used to
describe document contents.
• This effect is captured through the assignment of numerical weights to each index term of
a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for Indexing the


documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessing

Index File Indexer


friend 2 4
(Inverted file).
roman 1 2

countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms and a link to
one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain the keyword

•Index file usually has index terms in a sorted order.


–The sort order of the terms in the index file provides an order on a physical file

•An index file is list of search terms that are organized for associative look-up, i.e.,
to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several occurrences.)

•For organizing index file for a collection of documents, there are various options
available:
–Decide what data structure and/or file structure to use. Is it sequential file, inverted file, suffix
array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently.


–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
• a particular attribute is chosen as primary key whose value will
determine the order of the records.
• when the first key fails to discriminate among records, a second key is
chosen to give an order.
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all documents did
enact
1
1
Sequential file
have been julius 1
Doc
caesar 1
tokenized, stopwords I 1 Term No.
are removed, and was
killed
1
1 1 ambition 2
normalization and I 1
2 brutus 1
the 1
stemming are capitol 1
3 brutus 2
applied, to generate brutus
killed
1
1 4 capitol 1
index terms me 1
so 2 5 caesar 1
• These index terms in let 2
6 caesar 2
it 2
sequential file are be 2 7 caesar 2
sorted in with
caesar
2
2 8 enact 1
alphabetical order the 2
9 julius 1
noble 2
brutus 2 10 kill 1
hath 2
told 2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using
binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added. Inserting a
new record may require moving a large proportion of the file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence, the
newest entries are added at the end of the file & therefore pose no
difficulty to updating. But searching becomes very tough; it requires linear
time
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
–Building and maintaining an inverted index is a relatively low cost risk. On a
text of n words an inverted index can be built in O(n) time, n is number of
keywords

• Content of the inverted file:


–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document
collection)
Inverted file
• The occurrence: contains one record per term, listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

–Locations/Positions of words in the text


Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds searching
for relevant documents

•Why location?
– Having information about the location of each term within the
document helps for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations are
19 2 7, 212 performed before
building the index.
22 1 56
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
•An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
•Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even
for large text collection, and
– Posting file to be kept on disk for accessing to documents
Inverted index storage
•Separation of inverted file into vocabulary and posting file is a
good idea.
–Vocabulary: For searching purpose we need only word list. This allows
the vocabulary to be kept in memory at search time since the space
required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct words.
Hence, the size of index is 100 MBs, which can easily be held in memory of a
dedicated computer.

–Posting file requires much more space.


• For each word appearing in the text we are keeping statistical information related to
word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
•How to speed up access to inverted file?
Vocabulary file
•A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any
of the documents (in lexicographical order) and
–For each word a pointer to posting file
•Records kept for each term j in the word list contains the
following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
•For each distinct term in the vocabulary, stores a list of
pointers to the documents that contain that term.
•Each element in an inverted list is called a posting, i.e.,
the occurrence of a term in a document
•It is stored as a separate inverted list for each column,
i.e., a list corresponding to each term in the index file.
– Each list consists of one or many individual postings related
to Document ID, TF and location information about a given
term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted lists


Bus 3 4
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents have julius 1 caesar 1
caesar 2
been tokenized caesar
I
1
1 caesar 2
the inverted file was
killed
1
1
did
enact
1
1
is sorted by I 1 has 1
the 1
terms capitol 1
I
I
1
1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term
frequency
•Multiple term Term Doc #
entries in a single ambition 2 Term Doc # TF
document are brutus 1 ambition 2 1
merged and brutus 2 brutus 1 1
frequency brutus 2 1
capitol 1
information capitol 1 1
caesar 1
added caesar 1 1
caesar 2
•Counting caesar 2 caesar 2 2
number of enact 1 enact 1 1
occurrence of julius 1 julius 1 1
terms in the kill 1 kill 1 2
collections helps kill 1 noble 2 1
to compute TF
noble 2
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Complexity Analysis
• The inverted index can be built in O(n) time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted searching
takes logarithmic time.
• To update the inverted index it is possible to apply
Incremental indexing which requires O(k) time, k is
number of new index terms
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise

You might also like