Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
Vasily Sidorov
1
Recap of last lecture
▪ Collection and vocabulary statistics: Heaps’ and Zipf’s laws
▪ Dictionary compression for Boolean indexes
▪ Dictionary string, blocks, front coding
▪ Postings compression: Gap encoding, prefix-unique codes
▪ Variable-Byte and Gamma codes
collection (text, xml markup etc) 3,600.0 MB
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0
2
Agenda
▪ Ranked retrieval
▪ Scoring documents
▪ Term frequency
▪ Collection statistics
▪ Weighting schemes
▪ Vector space scoring
3
Ranked retrieval
▪ Thus far, our queries have all been Boolean
▪ Documents either match or don’t
▪ Good for expert users with precise understanding of
their needs and the collection
▪ Also good for applications: programs can easily consume
1000s of results
▪ Not good for the majority of users
▪ Most users incapable of writing Boolean queries (or they
are, but it’s too much work)
▪ Most users don’t want to go through 1000s of results
▪ This is particularly true of web search
4
Ch. 6
5
Ranked retrieval models
▪ Rather than a set of documents satisfying a query
expression, in ranked retrieval, the system returns an
ordering over the (top) documents in the collection
for a query
▪ Free text queries: Rather than a query language of
operators and expressions, the user’s query is just
one or more words in a human language
▪ In principle, these are two separate choices, but in
practice, ranked retrieval has normally been
associated with free text queries and vice versa
6
Ch. 6
7
Ch. 6
8
Ch. 6
9
Ch. 6
10
Ch. 6
11
Ch. 6
13
Sec. 6.2
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
14
Bag of words model
▪ Vector representation doesn’t consider the ordering
of words in a document
▪ John is quicker than Mary and Mary is quicker than
John have the same vectors
▪ This is called the bag of words model.
▪ In a sense, this is a step back: The positional index
was able to distinguish these two documents
▪ We will look at “recovering” positional information
later in this course
▪ For now: bag of words model
15
Term frequency tf
▪ The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d
▪ We want to use tf when computing query-document
match scores. But how?
▪ Raw term frequency is not what we want:
▪ A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term
▪ But not 10 times more relevant
▪ Relevance does not increase proportionally with
term frequency
NB: frequency = count in IR
16
Sec. 6.2
Log-frequency weighting
▪
17
Sec. 6.2.1
Document frequency
▪ Rare terms are more informative than frequent terms
▪ Recall stop words
▪ Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
▪ A document containing this term is very likely to be
relevant to the query arachnocentric
▪ → We want a high weight for rare terms like
arachnocentric.
18
Sec. 6.2.1
idf weight
▪ dft is the document frequency of t: the number of
documents that contain t
▪ dft is an inverse measure of the informativeness of t
▪ dft N
▪ We define the idf (inverse document frequency) of t
by
idf t = log10 ( N/dft )
▪ We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
22
Sec. 6.2.1
tf-idf weighting
▪ The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d = log(1 + tf t ,d ) log 10 ( N / dft )
25
Sec. 6.3
26
Sec. 6.3
Documents as vectors
▪ So we have a |V|-dimensional vector space
▪ Terms are axes of the space
▪ Documents are points or vectors in this space
▪ Very high-dimensional: tens of millions of
dimensions when you apply this to a web search
engine
▪ These are very sparse vectors - most entries are zero.
27
Sec. 6.3
Queries as vectors
▪ Key idea 1: Do the same for queries: represent them
as vectors in the space
▪ Key idea 2: Rank documents according to their
proximity to the query in this space
▪ proximity = similarity of vectors
▪ proximity ≈ inverse of distance
▪ Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model.
▪ Instead: rank more relevant documents higher than
less relevant documents
28
Sec. 6.3
29
Sec. 6.3
30
Sec. 6.3
32
Sec. 6.3
Length normalization
▪ A vector can be (length-)normalized by dividing each
of its components by its length – for this we use the
L2 norm:
x 2 = ixi2
cosine(query,document)
Dot product Unit vectors
V
q•d q d qd
cos(q, d ) = = • = i =1 i i
q d
i=1 i
V V
qd q2
d 2
i =1 i
cos(q,d ) = q • d = qi di
V
i=1
for q, d length-normalized.
36
Cosine similarity illustrated
37
Sec. 6.3
40
Sec. 6.4
41
Sec. 6.4
44
Summary: tf-idf weighting
▪ The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d = (1 + log 10 tf t ,d ) log 10 ( N / df t )
▪ Best known weighting scheme in information retrieval
▪ Increases with the number of occurrences within a
document
▪ Increases with the rarity of the term in the collection
45
Ch. 6
46
Summary: cosine(query,document)
Dot product Unit vectors
V
q•d q d qd
cos(q, d ) = = • = i =1 i i
q d
i=1 i
V V
qd q2
d 2
i =1 i