The vector space model represents documents and queries as vectors of identifiers such as index terms. Each dimension corresponds to a separate term, with non-zero values indicating the term's presence in the document. Documents are compared to queries by calculating the cosine similarity between their vectors. This models allows ranking documents by relevance to the query and partial matching. It was an improvement over the Boolean model but has limitations such as not capturing term order or semantics.
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
393 views
Vector Space Model
The vector space model represents documents and queries as vectors of identifiers such as index terms. Each dimension corresponds to a separate term, with non-zero values indicating the term's presence in the document. Documents are compared to queries by calculating the cosine similarity between their vectors. This models allows ranking documents by relevance to the query and partial matching. It was an improvement over the Boolean model but has limitations such as not capturing term order or semantics.
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11
1
Vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System. Definitions Documents and queries are represented as vectors.
Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below). The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus). Vector operations can be used to compare documents with queries. Applications
Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document 2
vector and the original query vector where the query is represented as the same kind of vector as the documents. In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself:
Where is the intersection (i.e. the dot product) of the document (d 2 in the figure to the right) and the query (q in the figure) vectors, is the norm of vector d 2 , and is the norm of vector q. The norm of a vector is calculated as such:
As all vectors under consideration by this model are elementwise nonnegative, a cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered). See cosine similarity for further information. Example: tf-idf weights[edit] In the classic vector space model proposed by Salton, Wong and Yang [1] the term specific weights in the document vectors are products of local and global parameters. The model is known as term frequency-inverse document frequency model. The weight vector for document d is , where
and is term frequency of term t in document d (a local parameter) is inverse document frequency (a global parameter). is the total number of documents in the document set; is the number of documents containing the term t. Using the cosine the similarity between document d j and query q can be calculated as: 3
Advantages The vector space model has the following advantages over the Standard Boolean model: 1. Simple model based on linear algebra 2. Term weights not binary 3. Allows computing a continuous degree of similarity between queries and documents 4. Allows ranking documents according to their possible relevance 5. Allows partial matching Limitations The vector space model has the following limitations: 1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) 2. Search keywords must precisely match document terms; word substrings might result in a "false positive match" 3. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". 4. The order in which the terms appear in the document is lost in the vector space representation. 5. Theoretically assumes terms are statistically independent. 6. Weighting is intuitive but not very formal. Many of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as singular value decomposition and lexical databases such as WordNet. Models based on and extending the vector space model Models based on and extending the vector space model include: Generalized vector space model Latent semantic analysis Term Discrimination Rocchio Classification 4
Software that implements the vector space model The following software packages may be of interest to those wishing to experiment with vector models and implement search services based upon them. Free open source software Apache Lucene. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. SemanticVectors. Semantic Vector indexes, created by applying a Random Projection algorithm (similar to Latent semantic analysis) to term-document matrices created using Apache Lucene. Gensim is a Python+NumPy framework for Vector Space modelling. It contains incremental (memory-efficient) algorithms for Tfidf, Latent Semantic Indexing, Random Projections and Latent Dirichlet Allocation. Weka. Weka is popular data mining package for Java including WordVectors and Bag Of Words models. Compressed vector space in C++ by Antonio Gulli Text to Matrix Generator (TMG) MATLAB toolbox that can be used for various tasks in text mining specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv) clustering, v) classification. Most of TMG is written in MATLAB and parts in Perl. It contains implementations of LSI, clustered LSI, NMF and other methods. SenseClusters, an open source package that supports context and word clustering using Latent Semantic Analysis and word co-occurrence matrices. S-Space Package, a collection of algorithms for exploring and working with statistical semantics. Further reading G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space model was presented) David Dubin (2004), The Most Influential Paper Gerard Salton Never Wrote (Explains the history of the Vector Space Model and the non-existence of a frequently cited publication) Description of the vector space model 5
Description of the classic vector space model by Dr E. Garcia Relationship of vector space search to the "k-Nearest Neighbor" search Scoring, term weighting and the vector space model http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector- space-model-1.html Thus far we have dealt with indexes that support Boolean queries: a document either matches or does not match a query. In the case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand. In this chapter we initiate the study of assigning a score to a (query, document) pair. This chapter consists of three main ideas. 1. We introduce parametric and zone indexes in Section 6.1 , which serve two purposes. First, they allow us to index and retrieve documents by metadata such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query. 2. Next, in Section 6.2 we develop the idea of weighting the importance of a term in a document, based on the statistics of occurrence of the term. 3. In Section 6.3 we show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring. Section 6.4 develops several variants of term-weighting for the vector space model. Chapter 7 develops computational aspects of vector space scoring, and related topics. As we develop these ideas, the notion of a query will assume multiple nuances. In Section 6.1 we consider queries in which specific query terms occur in specified regions of a matching document. Beginning Section 6.2 we will in fact relax the requirement of matching specific regions of a document; instead, we will look at so-called free text queries that simply consist of query terms with no specification on their relative order, importance or where in a document they should be found. The bulk of our study of scoring will be in this latter notion of a query being such a set of terms. 6
Parametric and zone indexes We have thus far viewed a document as a sequence of terms. In fact, most documents have additional structure. Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document. By metadata, we mean specific forms of data about a document, such as its author(s), title and date of publication. This metadata would generally include fields such as the date of creation and the format of the document, as well the author and possibly the title of the document. The possible values of a field should be thought of as finite - for instance, the set of all dates of authorship. Consider queries of the form ``find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick''. Query processing then consists as usual of postings intersections, except that we may merge postings from standard inverted as well as parametric indexes . There is one parametric index for each field (say, date of creation); it allows us to select only the documents matching a date specified in the query. Figure 6.1 illustrates the user's view of such a parametric search. Some of the fields may assume ordered values, such as dates; in the example query above, the year 1601 is one such field value. The search engine may support querying ranges on such ordered values; to this end, a structure like a B-tree may be used for the field's dictionary. Parametric search.In this example we have a collection with fields allowing us to select publications by zones such as Author and fields such as Language. Zones are similar to fields, except the contents of a zone can be arbitrary free text. Whereas a field may take on a relatively small set of values, a zone can be thought of as an arbitrary, unbounded amount of text. For instance, document titles and abstracts are generally treated as zones. We may build a separate inverted index for each zone of a document, to support queries such as ``find documents with merchant in the title and william in the author list and the phrase gentle rain in the body''. This has the effect of building an index that looks like Figure 6.2. 7
Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of languages, or the set of dates), the dictionary for a zone index must structure whatever vocabulary stems from the text of that zone.
In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings. In Figure 6.3 for instance, we show how occurrences of william in the title and author zones of various documents are encoded. Such an encoding is useful when the size of the dictionary is a concern (because we require the dictionary to fit in main memory). But there is another important reason why the encoding of Figure 6.3 is useful: the efficient computation of scores using a technique we will call weighted zone scoring .
Figure 6.3: Zone index in which the zone is encoded in the postings rather than the dictionary. Term frequency and weighting Thus far, scoring has hinged on whether or not a query term is present in a zone within a document. We take the next logical step: a document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score. To motivate this, we recall the notion of a free text query introduced in Section 1.4 : a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators (such as Boolean operators). This query style, which is extremely popular on the web, views the query as simply a set of words. A plausible scoring mechanism then is to compute a score that is the sum, over the query terms, of the match scores between each query term and the document. Towards this end, we assign to each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. We would like to compute a score between a query term and a document , based on the weight of in . The simplest approach is to assign the weight to be equal to the number of occurrences of term in 8
document . This weighting scheme is referred to as term frequency and is denoted , with the subscripts denoting the term and the document in order. For a document , the set of weights determined by the weights above (or indeed any weighting function that maps the number of occurrences of in to a positive real value) may be viewed as a quantitative digest of that document. In this view of a document, known in the literature as the bag of words model , the exact ordering of the terms in a document is ignored but the number of occurrences of each term is material (in contrast to Boolean retrieval). We only retain information on the number of occurrences of each term. Thus, the document ``Mary is quicker than John'' is, in this view, identical to the document ``John is quicker than Mary''. Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content. We will develop this intuition further in Section 6.3 . Before doing so we first study the question: are all words in a document equally important? Clearly not; in Section 2.2.2 (page ) we looked at the idea of stop words - words that we decide not to index at all, and therefore do not contribute in any way to retrieval and scoring. Inverse document frequency Raw term frequency as above suffers from a critical problem: all terms are considered equally important when it comes to assessing relevancy on a query. In fact certain terms have little or no discriminating power in determining relevance. For instance, a collection of documents on the auto industry is likely to have the term auto in almost every document. To this end, we introduce a mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance determination. An immediate idea is to scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in the collection. The idea would be to reduce the weight of a term by a factor that grows with its collection frequency. Instead, it is more commonplace to use for this purpose the document frequency , defined to be the number of documents in the collection that contain a term . This is because in trying to discriminate between documents for the purpose of scoring it is better to use a document-level statistic (such as the number of documents containing a term) than to use a collection-wide statistic for the term.
Figure 6.7: Collection frequency (cf) and document frequency (df) behave differently, as in this example from the Reuters collection. 9
The reason to prefer df to cf is illustrated in Figure 6.7 , where a simple example shows that collection frequency (cf) and document frequency (df) can behave rather differently. In particular, the cf values for both try and insurance are roughly equal, but their df values differ significantly. Intuitively, we want the few documents that contain insurance to get a higher boost for a query on insurance than the many documents containing try get from a query on try. How is the document frequency df of a term used to scale its weight? Denoting as usual the total number of documents in a collection by , we define the inverse document frequency of a term as follows:
(21)
Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low. Figure 6.8 gives an example of idf's in the Reuters collection of 806,791 documents; in this example logarithms are to the base 10. In fact, as we will see in Exercise 6.2.2 , the precise base of the logarithm is not material to ranking. We will give on page 11.3.3 a justification of the particular form in Equation 21.
Tf-idf weighting We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document. The tf-idf weighting scheme assigns to term a weight in document given by
(22) 10
In other words, assigns to term a weight in document that is 1. highest when occurs many times within a small number of documents (thus lending high discriminating power to those documents); 2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal); 3. lowest when the term occurs in virtually all documents. At this point, we may view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by (22). For dictionary terms that do not occur in a document, this weight is zero. This vector form will prove to be crucial to scoring and ranking; we will develop these ideas in Section 6.3 . As a first step, we introduce the overlap score measure: the score of a document is the sum, over all query terms, of the number of times each of the query terms occurs in . We can refine this idea so that we add up not the number of occurrences of each query term in , but instead the tf-idf weight of each term in .
(23)
In Section 6.3 we will develop a more rigorous form of Equation 23. Exercises. Why is the idf of a term always finite? What is the idf of a term that occurs in every document? Compare this with the use of stop word lists. Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in Figure 6.9 .
11
Figure 6.9: Table of tf values for Exercise 6.2.2. Compute the tf-idf weights for the terms car, auto, insurance, best, for each document, using the idf values from Figure 6.8 . Can the tf-idf weight of a term in a document exceed 1? How does the base of the logarithm in (21) affect the score calculation in (23)? How does the base of the logarithm affect the relative scores of two documents on a given query? If the logarithm in (21) is computed base 2, suggest a simple approximation to the idf of a term.