0% found this document useful (0 votes)
393 views

Vector Space Model

The vector space model represents documents and queries as vectors of identifiers such as index terms. Each dimension corresponds to a separate term, with non-zero values indicating the term's presence in the document. Documents are compared to queries by calculating the cosine similarity between their vectors. This models allows ranking documents by relevance to the query and partial matching. It was an improvement over the Boolean model but has limitations such as not capturing term order or semantics.

Uploaded by

jeysam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
393 views

Vector Space Model

The vector space model represents documents and queries as vectors of identifiers such as index terms. Each dimension corresponds to a separate term, with non-zero values indicating the term's presence in the document. Documents are compared to queries by calculating the cosine similarity between their vectors. This models allows ranking documents by relevance to the query and partial matching. It was an improvement over the Boolean model but has limitations such as not capturing term order or semantics.

Uploaded by

jeysam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1

Vector space model


Vector space model or term vector model is an algebraic model for representing text
documents (and any objects, in general) as vectors of identifiers, such as, for example, index
terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
Its first use was in the SMART Information Retrieval System.
Definitions
Documents and queries are represented as vectors.


Each dimension corresponds to a separate term. If a term occurs in the document, its value in the
vector is non-zero. Several different ways of computing these values, also known as (term)
weights, have been developed. One of the best known schemes is tf-idf weighting (see the
example below).
The definition of term depends on the application. Typically terms are single words, keywords,
or longer phrases. If the words are chosen to be the terms, the dimensionality of the vector is the
number of words in the vocabulary (the number of distinct words occurring in the corpus).
Vector operations can be used to compare documents with queries.
Applications

Relevance rankings of documents in a keyword search can be calculated, using the assumptions
of document similarities theory, by comparing the deviation of angles between each document
2

vector and the original query vector where the query is represented as the same kind of vector as
the documents.
In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the
angle itself:

Where is the intersection (i.e. the dot product) of the document (d
2
in the figure to the
right) and the query (q in the figure) vectors, is the norm of vector d
2
, and is the norm
of vector q. The norm of a vector is calculated as such:

As all vectors under consideration by this model are elementwise nonnegative, a cosine value of
zero means that the query and document vector are orthogonal and have no match (i.e. the query
term does not exist in the document being considered). See cosine similarity for further
information.
Example: tf-idf weights[edit]
In the classic vector space model proposed by Salton, Wong and Yang
[1]
the term specific
weights in the document vectors are products of local and global parameters. The model is
known as term frequency-inverse document frequency model. The weight vector for document d
is , where

and
is term frequency of term t in document d (a local parameter)
is inverse document frequency (a global parameter). is
the total number of documents in the document set; is the number
of documents containing the term t.
Using the cosine the similarity between document d
j
and query q can be calculated as:
3


Advantages
The vector space model has the following advantages over the Standard Boolean model:
1. Simple model based on linear algebra
2. Term weights not binary
3. Allows computing a continuous degree of similarity between queries and documents
4. Allows ranking documents according to their possible relevance
5. Allows partial matching
Limitations
The vector space model has the following limitations:
1. Long documents are poorly represented because they have poor similarity values (a small
scalar product and a large dimensionality)
2. Search keywords must precisely match document terms; word substrings might result in a
"false positive match"
3. Semantic sensitivity; documents with similar context but different term vocabulary won't
be associated, resulting in a "false negative match".
4. The order in which the terms appear in the document is lost in the vector space
representation.
5. Theoretically assumes terms are statistically independent.
6. Weighting is intuitive but not very formal.
Many of these difficulties can, however, be overcome by the integration of various tools,
including mathematical techniques such as singular value decomposition and lexical databases
such as WordNet.
Models based on and extending the vector space model
Models based on and extending the vector space model include:
Generalized vector space model
Latent semantic analysis
Term Discrimination
Rocchio Classification
4

Software that implements the vector space model
The following software packages may be of interest to those wishing to experiment with vector
models and implement search services based upon them.
Free open source software
Apache Lucene. Apache Lucene is a high-performance, full-featured text search engine library
written entirely in Java.
SemanticVectors. Semantic Vector indexes, created by applying a Random Projection
algorithm (similar to Latent semantic analysis) to term-document matrices created using
Apache Lucene.
Gensim is a Python+NumPy framework for Vector Space modelling. It contains
incremental (memory-efficient) algorithms for Tfidf, Latent Semantic Indexing,
Random Projections and Latent Dirichlet Allocation.
Weka. Weka is popular data mining package for Java including WordVectors and Bag Of
Words models.
Compressed vector space in C++ by Antonio Gulli
Text to Matrix Generator (TMG) MATLAB toolbox that can be used for various tasks in
text mining specifically i) indexing, ii) retrieval, iii) dimensionality reduction, iv)
clustering, v) classification. Most of TMG is written in MATLAB and parts in Perl. It
contains implementations of LSI, clustered LSI, NMF and other methods.
SenseClusters, an open source package that supports context and word clustering using
Latent Semantic Analysis and word co-occurrence matrices.
S-Space Package, a collection of algorithms for exploring and working with statistical
semantics.
Further reading
G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing,"
Communications of the ACM, vol. 18, nr. 11, pages 613620. (Article in which a vector space
model was presented)
David Dubin (2004), The Most Influential Paper Gerard Salton Never Wrote (Explains
the history of the Vector Space Model and the non-existence of a frequently cited
publication)
Description of the vector space model
5

Description of the classic vector space model by Dr E. Garcia
Relationship of vector space search to the "k-Nearest Neighbor" search
Scoring, term weighting and the vector space model
http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-
space-model-1.html
Thus far we have dealt with indexes that support Boolean queries: a document either matches or
does not match a query. In the case of large document collections, the resulting number of
matching documents can far exceed the number a human user could possibly sift through.
Accordingly, it is essential for a search engine to rank-order the documents matching a query. To
do this, the search engine computes, for each matching document, a score with respect to the
query at hand. In this chapter we initiate the study of assigning a score to a (query, document)
pair. This chapter consists of three main ideas.
1. We introduce parametric and zone indexes in Section 6.1 , which serve two purposes.
First, they allow us to index and retrieve documents by metadata such as the language in
which a document is written. Second, they give us a simple means for scoring (and
thereby ranking) documents in response to a query.
2. Next, in Section 6.2 we develop the idea of weighting the importance of a term in a
document, based on the statistics of occurrence of the term.
3. In Section 6.3 we show that by viewing each document as a vector of such weights, we
can compute a score between a query and each document. This view is known as vector
space scoring.
Section 6.4 develops several variants of term-weighting for the vector space model. Chapter 7
develops computational aspects of vector space scoring, and related topics.
As we develop these ideas, the notion of a query will assume multiple nuances. In Section 6.1 we
consider queries in which specific query terms occur in specified regions of a matching
document. Beginning Section 6.2 we will in fact relax the requirement of matching specific
regions of a document; instead, we will look at so-called free text queries that simply consist of
query terms with no specification on their relative order, importance or where in a document they
should be found. The bulk of our study of scoring will be in this latter notion of a query being
such a set of terms.
6

Parametric and zone indexes
We have thus far viewed a document as a sequence of terms. In fact, most documents have
additional structure. Digital documents generally encode, in machine-recognizable form, certain
metadata associated with each document. By metadata, we mean specific forms of data about a
document, such as its author(s), title and date of publication. This metadata would generally
include fields such as the date of creation and the format of the document, as well the author and
possibly the title of the document. The possible values of a field should be thought of as finite -
for instance, the set of all dates of authorship.
Consider queries of the form ``find documents authored by William Shakespeare in 1601,
containing the phrase alas poor Yorick''. Query processing then consists as usual of postings
intersections, except that we may merge postings from standard inverted as well as parametric
indexes . There is one parametric index for each field (say, date of creation); it allows us to select
only the documents matching a date specified in the query. Figure 6.1 illustrates the user's view
of such a parametric search. Some of the fields may assume ordered values, such as dates; in the
example query above, the year 1601 is one such field value. The search engine may support
querying ranges on such ordered values; to this end, a structure like a B-tree may be used for the
field's dictionary.
Parametric search.In this example we have a collection with fields allowing us to select
publications by zones such as Author and fields such as Language.
Zones are similar to fields, except the contents of a zone can be arbitrary free text. Whereas a
field may take on a relatively small set of values, a zone can be thought of as an arbitrary,
unbounded amount of text. For instance, document titles and abstracts are generally treated as
zones. We may build a separate inverted index for each zone of a document, to support queries
such as ``find documents with merchant in the title and william in the author list and the phrase
gentle rain in the body''. This has the effect of building an index that looks like Figure 6.2.
7

Whereas the dictionary for a parametric index comes from a fixed vocabulary (the set of
languages, or the set of dates), the dictionary for a zone index must structure whatever
vocabulary stems from the text of that zone.

In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in
the postings. In Figure 6.3 for instance, we show how occurrences of william in the title and
author zones of various documents are encoded. Such an encoding is useful when the size of the
dictionary is a concern (because we require the dictionary to fit in main memory). But there is
another important reason why the encoding of Figure 6.3 is useful: the efficient computation of
scores using a technique we will call weighted zone scoring .

Figure 6.3: Zone index in which the zone is encoded in the postings rather than the dictionary.
Term frequency and weighting
Thus far, scoring has hinged on whether or not a query term is present in a zone within a
document. We take the next logical step: a document or zone that mentions a query term more
often has more to do with that query and therefore should receive a higher score. To motivate
this, we recall the notion of a free text query introduced in Section 1.4 : a query in which the
terms of the query are typed freeform into the search interface, without any connecting search
operators (such as Boolean operators). This query style, which is extremely popular on the web,
views the query as simply a set of words. A plausible scoring mechanism then is to compute a
score that is the sum, over the query terms, of the match scores between each query term and the
document.
Towards this end, we assign to each term in a document a weight for that term, that depends on
the number of occurrences of the term in the document. We would like to compute a score
between a query term and a document , based on the weight of in . The simplest
approach is to assign the weight to be equal to the number of occurrences of term in
8

document . This weighting scheme is referred to as term frequency and is denoted , with
the subscripts denoting the term and the document in order.
For a document , the set of weights determined by the weights above (or indeed any
weighting function that maps the number of occurrences of in to a positive real value) may
be viewed as a quantitative digest of that document. In this view of a document, known in the
literature as the bag of words model , the exact ordering of the terms in a document is ignored but
the number of occurrences of each term is material (in contrast to Boolean retrieval). We only
retain information on the number of occurrences of each term. Thus, the document ``Mary is
quicker than John'' is, in this view, identical to the document ``John is quicker than Mary''.
Nevertheless, it seems intuitive that two documents with similar bag of words representations are
similar in content. We will develop this intuition further in Section 6.3 .
Before doing so we first study the question: are all words in a document equally important?
Clearly not; in Section 2.2.2 (page ) we looked at the idea of stop words - words that we
decide not to index at all, and therefore do not contribute in any way to retrieval and scoring.
Inverse document frequency
Raw term frequency as above suffers from a critical problem: all terms are considered equally
important when it comes to assessing relevancy on a query. In fact certain terms have little or no
discriminating power in determining relevance. For instance, a collection of documents on the
auto industry is likely to have the term auto in almost every document. To this end, we introduce
a mechanism for attenuating the effect of terms that occur too often in the collection to be
meaningful for relevance determination. An immediate idea is to scale down the term weights of
terms with high collection frequency, defined to be the total number of occurrences of a term in
the collection. The idea would be to reduce the weight of a term by a factor that grows with its
collection frequency.
Instead, it is more commonplace to use for this purpose the document frequency , defined to
be the number of documents in the collection that contain a term . This is because in trying to
discriminate between documents for the purpose of scoring it is better to use a document-level
statistic (such as the number of documents containing a term) than to use a collection-wide
statistic for the term.

Figure 6.7: Collection frequency (cf) and document frequency (df) behave differently, as in this
example from the Reuters collection.
9

The reason to prefer df to cf is illustrated in Figure 6.7 , where a simple example shows that
collection frequency (cf) and document frequency (df) can behave rather differently. In
particular, the cf values for both try and insurance are roughly equal, but their df values differ
significantly. Intuitively, we want the few documents that contain insurance to get a higher boost
for a query on insurance than the many documents containing try get from a query on try.
How is the document frequency df of a term used to scale its weight? Denoting as usual the total
number of documents in a collection by , we define the inverse document frequency of a term
as follows:


(21)


Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low. Figure
6.8 gives an example of idf's in the Reuters collection of 806,791 documents; in this example
logarithms are to the base 10. In fact, as we will see in Exercise 6.2.2 , the precise base of the
logarithm is not material to ranking. We will give on page 11.3.3 a justification of the particular
form in Equation 21.

Tf-idf weighting
We now combine the definitions of term frequency and inverse document frequency, to produce
a composite weight for each term in each document. The tf-idf weighting scheme assigns to term
a weight in document given by


(22)
10



In other words, assigns to term a weight in document that is
1. highest when occurs many times within a small number of documents (thus lending
high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many documents
(thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents.
At this point, we may view each document as a vector with one component corresponding to
each term in the dictionary, together with a weight for each component that is given by (22). For
dictionary terms that do not occur in a document, this weight is zero. This vector form will prove
to be crucial to scoring and ranking; we will develop these ideas in Section 6.3 . As a first step,
we introduce the overlap score measure: the score of a document is the sum, over all query
terms, of the number of times each of the query terms occurs in . We can refine this idea so
that we add up not the number of occurrences of each query term in , but instead the tf-idf
weight of each term in .

(23)


In Section 6.3 we will develop a more rigorous form of Equation 23.
Exercises.
Why is the idf of a term always finite?
What is the idf of a term that occurs in every document? Compare this with the use of
stop word lists.
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in
Figure 6.9 .

11

Figure 6.9: Table of tf values for Exercise 6.2.2.
Compute the tf-idf weights for the terms car, auto, insurance, best, for each document,
using the idf values from Figure 6.8 .
Can the tf-idf weight of a term in a document exceed 1?
How does the base of the logarithm in (21) affect the score calculation in (23)? How does
the base of the logarithm affect the relative scores of two documents on a given query?
If the logarithm in (21) is computed base 2, suggest a simple approximation to the idf of a
term.

You might also like