Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Simple Tokenizing
Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, usually they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.
Tokenizing HTML
Should text in HTML commands not typically seen by the user be included as tokens?
Words appearing in URLs. Words appearing in meta text of images.
Simplest approach used in VSR is to exclude all HTML tag information from tokenization.
Parses HTML using utilities in Java Swing package, and collects all raw text.
4
Documents in VSR
Document
TextStringDocument
(used for typed queries)
FileDocument
TextFileDocument
(used for ASCII files)
HTMLFileDocument
(used for HTML files)
Stopwords
It is typical to exclude high-frequency words (e.g. function words: a, the, in, to; pronouns: I, he, she, it). Stopwords are language dependent. VSR uses a standard set of about 500 for English. For efficiency, store strings for stopwords in a hashtable to recognize them in constant time.
6
Stemming
Reduce tokens to root form of words to recognize morphological variation.
computer, computational, computation all reduced to same token compute
Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion.
7
Porter Stemmer
Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words:
computer, computational, computation all reduced to same token comput
May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations.
8
Errors of omission:
cylinder, cylindrical create, creation Europe, European
Sparse Vectors
Vocabulary and therefore dimensionality of vectors can be very large, ~104 . However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors.
10
i
i 1
n(n 1) O(n 2 ) 2
11
variable
<
12
13
14
Inverted Index
df
Dj, tfj
D7, 4 D1, 3
3
2
science system
Index file 4 1
D2, 4
D5, 2 Postings lists
17
HashMap
tokenHash
double idf
ArrayList occList
18
Computing IDF
Let N be the total number of Documents; For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of Ts occList); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.
20
23
Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words.
25
q2
qn
Dn1DnB
D21D2B
Then retrieval time is O(|Q| B), which is typically, much better than nave retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N.
26
27
Let Y be the length of D as stored in its DocumentReference; Normalize Ds final score to S/(L * Y); Sort retrieved documents in R by final score and return results in an array.
29
Efficiency Note
To save computation and an extra iteration through the tokens in the query, in VSR, the computation of the length of the query vector is integrated with the processing of query tokens during retrieval.
30
User Interface
Until user terminates with an empty query: Prompt user to type a query, Q. Compute the ranked array of retrievals R for Q; Print the name of top N documents in R; Until user terminates with an empty command: Prompt user for a command for this query result: 1) Show next N retrievals; 2) Show the Mth retrieved document;
(document shown in Firefox window)
31
Running VSR
Invoke the system using the main method of InvertedIndex.
java ir.vsr.InvertedIndex <corpus-directory> Make sure your CLASSPATH has /u/mooney/ir-code
Will index all files in a directory and then process queries interactively. Optional flags include:
-html: Strips HTML tags from files -stem: Stems tokens with Porter stemmer
32
33