Chapter 4
Chapter 4
Chapter 4
Word evidence: Bag of words
IR Models IR systems usually adopt index terms to index and
retrieve documents
Outline Each document is represented by a set of
representative keywords or index terms (called Bag
Introduction of IR Models of Words)
1
Mapping Documents & Queries Weighting Terms in Vector Sapce
Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which • The importance of the index terms is represented by
shows occurrence of terms in the document collection weights associated to them
or query • Problem: to show the importance of the index term for
An entry in the matrix corresponds to the “weight” of a describing the document/query contents, what weight can
term in the document; d (t , t ,...,t ); q (t , t ,...,t ) we assign?
j 1, j 2, j N, j k 1,k 2,k N ,k
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Document collection is mapped to
T1 T2 …. TN term-by-document matrix – Similarity: number of terms in common
D1 w11 w12 … w1N – The documents are viewed as • Problem: Not all terms equally interesting
D2 w21 w22 … w2N vectors in multidimensional space – E.g. the vs. dog vs. cat
: : : : • “Nearby” vectors are related • Solution: Replace binary weights with non-binary weights
: : : : – Normalize the weight as usual for
DM wM1 wM2 … wMN vector length to avoid the effect of d j ( w1, j , w2, j ,..., wN , j ); qk ( w1,k , w2,k ,..., wN , k )
document length
2
The Boolean Model: Example The Boolean Model: Further Example
• Given the following determine documents retrieved by the
Given the following three documents, Construct Term – document
Boolean model based IR system
matrix and find the relevant documents retrieved by the
• Index Terms: K1, …,K8. Boolean model for given query
• Documents: • D1: “Shipment of gold damaged in a fire”
1. D1 = {K1, K2, K3, K4, K5} • D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
2. D2 = {K1, K2, K3, K4}
• Query: “gold (silver truck)”
3. D3 = {K2, K4, K6, K8}
Table below shows document –term (ti) matrix
4. D4 = {K1, K3, K5, K7}
arrive damage deliver fire gold silver ship truck Also find the relevant
5. D5 = {K4, K5, K6, K7, K8} documents for the queries:
D1 (a) “golddelivery”;
6. D6 = {K1, K2, K3, K4} (b) ship ¬ gold;
D2
• Query: K1 (K2 K3) (c) “silver truck”
D3
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
query
= {D1, D2, D6}
Exercise Exercise: What are the relevant documents retrieved for the query:
((chaucer OR milton) AND (swift OR shakespeare))
Given the following three documents with Doc No Term 1 Term 2 Term 3 Term 4
3
Drawbacks of the Boolean Model Vector-Space Model (VSM)
• Retrieval based on binary decision criteria with no
notion of partial matching
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
• No ranking of the documents is provided (absence of because,
a grading scale) • Use of binary weights is too limiting
• Information need has to be translated into a Boolean • Non-binary weights provide consideration for partial
expression which most users find awkward matches
• The Boolean queries formulated by the users are • These term weights are used to compute a degree of
most often too simplistic similarity between a query and each document
• As a consequence, the Boolean model frequently • Ranked set of documents provides for better matching
returns either too few or too many documents in • The idea behind VSM is that
response to a user query • the meaning of a document is conveyed by the words
• Just changing a boolean operator from “AND” to “OR” used in that document
changes the result from intersection to union
4
Computing weights Computing Weights
• How do we compute weights for term i in document j and Let,
query q; wij and wiq ? N be the total number of documents in the collection
ni be the number of documents which contain ki
• A good weight must take into account two effects: freq(i,j) raw frequency of ki within dj
– Quantification of intra-document contents (similarity)
A normalized tf factor is given by
• The tf factor, the term frequency within a document
f(i,j) = freq(i,j) / max(freq(l,j))
– Quantification of inter-documents separation where the maximum is computed over all terms which
(dissimilarity) occur within the document dj
• The idf factor, the inverse document frequency
The idf factor is computed as
– As a result of which most IR systems are using tf*idf idf(i) = log (N/ni)
weighting technique: the log is used to make the values of tf and idf
wij = tf(i,j) * idf(i) comparable. It can also be interpreted as the amount of
information associated with the term ki.
5
Vector-Space Model: Example
Similarity Measure
j
Suppose we query for the query: Q: “gold silver
dj truck”. The database collection consists of
three documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
q
• D2: “Delivery of silver arrived in a silver truck”
• Sim(q,dj) = cos() i
• D3: “Shipment of gold arrived in a truck”
Assume that all terms are used, including
n
dj q wi , j qi , k
sim(d j , q ) i 1
common terms, stop words, and also no terms
n n
dj q wi2, j qi2,k are reduced to root terms.
i 1 i 1
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 Show retrieval results in ranked order?
• A document is retrieved even if it matches the query
terms only partially
6
Vector-Space Model: Example Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1) Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
• First, for each document and query, compute all vector
lengths (zero terms ignored) Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
|d1|= 0.477 2 0 .477 2 0 .176 2 0.176 2 = 0.517 = 0.719
Finally, we sort and rank documents in descending
|d2|= 0 .176 2 0 .477 2 0 .176 2 0 .176 2 = 1 .2001 = 1.095
order according to the similarity scores
|d3|= 0 .176 2 0 .176 2 0 .176 2 0 .176 2 = 0.124 = 0.352 Rank 1: Doc 2 = 0.8246
|q|= = 0.2896 = 0.538 Rank 2: Doc 3 = 0.3271
0.176 2 0 .471 2 0.176 2
Rank 3: Doc 1 = 0.0801
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310 • Exercise: using normalized TF, rank documents using
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862 cosine similarity measure? Hint: Normalize TF of term i
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620 in doc j using max frequency of a term k in document j.
7
Exercises
• Given the following documents, rank documents
according to their relevance to the query using cosine
similarity, Euclidean distance and inner product
measures?
Thank you
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?
29 30