Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
Retrieval Models
• CS 293S, 2017
• Some of slides from R. Mooney (UTexas), J.
Ghosh (UT ECE), D. Lee (USTHK).
1
Table of Content
3
Retrieval Models
• A retrieval model specifies the details of:
§ 1) Document representation
§ 2) Query representation
§ 3) Retrieval function: how to find relevant results
§ Determines a notion of relevance.
• Classical models
§ Boolean models (set theoretic)
– Extended Boolean
§ Vector space models (statistical/algebraic)
– Generalized VS
– Latent Semantic Indexing
§ Probabilistic models
4
Boolean Model
• A document is represented as a set of keywords.
• Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT, including the use
of brackets to indicate scope.
§ Rio & Brazil | Hilo & Hawaii
§ hotel & !Hilton
• Output: Document is relevant or not. No partial
matches or ranking.
§ Can be extended to include ranking.
• Popular retrieval model in old time:
§ Easy to understand. Clean formalism.
§ But still too complex for web users
5
Query example: Shakespeare plays
6
Term-document incidence 1 if play contains
word, 0 otherwise
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Dictionary Postings
Sorted by docID (more later on why). 9
Possible Document
Preprocessing Steps
• Strip unwanted characters/markup (e.g. HTML tags,
punctuation, numbers, etc.).
• Break into tokens (keywords) on whitespace.
• Possible linguistic processing (used in some
applications, but dangerous for general web search)
§ Stemming (cards ->card)
§ Remove common stopwords (e.g. a, the, it, etc.).
§ Used sometime, but dangerous
• Build inverted index
§ keyword à list of docs containing it.
§ Common phrases may be detected first using a
domain specific dictionary.
10
Inverted index construction
Tokenizer
Token stream. Friends Romans Countrymen
More on Linguistic
these later. modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 16
11
Discussions
• Index construction
§ Stemming?
§ Which terms in a doc do we index?
– All words or only “important” ones?
– Stopword list: terms that are so common
§ they MAY BE ignored for indexing.
§ e.g., the, a, an, of, to …
§ language-specific.
§ May have to be included for general web search
2 4 8 16 32 64 128 Brutus
1 2 3 5 8 13 21 34 Caesar
13
The merge
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
16
Boolean Models - Problems
• Very rigid: AND means all; OR means any.
• Difficult to express complex user requests.
§ Still too complex for general web users
• Difficult to control the number of documents
retrieved.
§ All matched documents will be returned.
• Difficult to rank output.
§ All matched documents logically satisfy the query.
• Difficult to perform relevance feedback.
§ If a document is identified by the user as relevant or
irrelevant, how should the query be modified?
17
Statistical Retrieval Models
• A document is typically represented by a bag of
words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of the
same element.
• User specifies a set of desired terms with optional
weights:
§ Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
§ Unweighted query terms:
Q = < database; text; information >
§ No Boolean conditions specified in the query.
18
Statistical Retrieval
• Retrieval based on similarity between
query and documents.
• Output documents are ranked
according to similarity to query.
• Similarity based on occurrence
frequencies of keywords in query and
document.
• Automatic relevance feedback can be supported:
§ Relevant documents “added” to query.
§ Irrelevant documents “subtracted” from query.
19
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, …, wtj)
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
20
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 5
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
T2 7 similarity? Distance? Angle?
Projection?
21
Issues for Vector Space Model
22
Term Weights: Term Frequency
23
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
24
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work
well.
25
Computing TF-IDF -- An Example
26
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
Binary:
§ D = 1, 1, 1, 0, 1, 1, 0
§ Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
28
Properties of Inner Product
29
Cosine Similarity Measure
• Cosine similarity measures the t3
cosine of the angle between two
vectors. q1
• Inner product normalized by the D1
vector lengths. Q
! ! t
q2
dj q = å ( wij × wiq)
•
t1
! ! i =1
t t
CosSim(dj, q) = dj × q å wij × å wiq
2 2
i =1 i =1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / Ö(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / Ö(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
30
Comments on Vector Space Models
• Simple, practical, and mathematically based
approach
• Provides partial matching and ranked results.
• Problems
§ Missing syntactic information (e.g. phrase structure,
word order, proximity information).
§ Missing semantic information
– word sense
– Assumption of term independence. ignores synonomy.
§ Lacks the control of a Boolean model (e.g., requiring
a term to appear in a document).
– Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but
both less frequently.
31