Week 3 - Probabilistic Retrieval and Relevance Feedback
Week 3 - Probabilistic Retrieval and Relevance Feedback
Information Retrieval- 1
Query Likelihood Model
Given query , determine the probability that document is relevant to
query
Bayes Rule
Assumptions
• , the probability of a document occurring is uniform across a collection
• is the same for all queries
Information Retrieval- 2
Language Modeling
Information Retrieval- 3
What is a Language Model?
Deterministic language model = automaton =
grammar information
retrieval
Information Retrieval- 4
Probabilistic Language Model
Unigram model: assign a probability to each term to appear
• More complex models can be used, e.g., bigrams
Model M1 Model M2
STOP 0.2 STOP 0.2
the 0.2 the 0.15
a 0.1 a 0.12
frog 0.03 frog 0.0002
Q toad 0.03 toad 0.0001
said 0.02 said 0.01
likes 0.015 likes 0.01
dog 0.01 dog 0.04
Information Retrieval- 5
Probability to Create a Query
What is the probability that a query has been generated by model ?
Information Retrieval- 6
Learning the Model
Learning the model means we have to estimate the probability of a
query to occur
where
• is the number of occurrences of in (term frequency)
• is the number of terms in the document (document length)
Information Retrieval- 7
Using the Model
Information Retrieval- 8
Consider the document:
“Information retrieval is the task of finding the documents satisfying
the information needs of the user”
Information Retrieval- 9
Consider the following document
Information Retrieval- 10
Issues with MLE
Problem 1: if the query contains a term not occurring in the document,
then !
Information Retrieval- 11
Smoothing
Idea: add a small weight for terms not occurring in a document
•the weight should be smaller than the normalized collection frequency
where
• = number of times term t occurs in collection
• = total number of terms in collection
Smoothed estimate
Information Retrieval- 13
Example
Using =1/2:
P(q|d1) = ½ * (0/7 + 1/13) * ½ * (1/7 + 2/13) ≈ 0.0057
P(q|d2) = ½ * (1/6 + 1/13) * ½ * (1/6 + 2/13) ≈ 0.0195
Albert Einstein
Information Retrieval- 14
Example: Comparing VS and PR
Information Retrieval- 15
Properties of Retrieval Models
Information Retrieval- 16
1.2.6 Query Expansion
Information Retrieval- 17
Two Methods for Extending Queries
1. Local Approach:
• Use information from current query results:
user relevance feedback
2. Global Approach:
• Use information from a document collection:
query expansion
Information Retrieval- 18
1.2.6.1 User Relevance Feedback
information items
content
feature
extraction
query
information need !
formulation browsing
system-modified query
(e.g. query term reweighting)
relevance feedback:
identify relevant results
Information Retrieval- 19
Feedback from Users
𝐷𝑟
Information Retrieval- 20
Rocchio Algorithm
Rocchio algorithm: find a query that optimally separates
relevant from non-relevant documents
Information Retrieval- 21
Illustration of Rocchio Algorithm
lower precision
lower recall
𝜇 ( 𝐷𝑟 )
Information Retrieval- 22
Illustration of Rocchio Algorithm
𝜇 ( 𝐷𝑟 ) − 𝜇 ( 𝐷 𝑛 )
𝜇 ( 𝐷𝑟 )
𝜇 ( 𝐷𝑛 )
Information Retrieval- 23
Illustration of Rocchio Algorithm
𝜇 ( 𝐷𝑟 ) + [ 𝜇 ( 𝐷𝑟 ) −𝜇 ( 𝐷𝑛 ) ]
𝜇 ( 𝐷𝑟 )
Information Retrieval- 24
Identifying Relevant Documents
Following the previous reasoning the optimal query is
Practical issues
• User relevance feedback is not complete
• Users do not necessarily identify non-relevant documents
• Original query should continue to be considered
Information Retrieval- 25
SMART: Practical Relevance Feedback
Approximation scheme for the theoretically optimal query vector
Information Retrieval- 26
Example
Query q= "application theory"
Result
0.77: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
0.68: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
0.23: B11 Oscillation Theory for Neutral Differential Equations with Delay
0.23: B12 Oscillation Theory of Delay Differential Equations
Query reformulation
Information Retrieval- 27
Discussion
Underlying assumptions of SMART algorithm
1. Original query contains sufficient number of relevant terms
2. Results contain new relevant terms that co-occur with original query
terms
3. Relevant documents form a single cluster
4. Users are willing to provide feedback (!)
All assumptions can be violated in practice
Practical considerations
• Modified queries are complex → expensive processing
• Explicit relevance feedback consumes user time → could be used in other
ways
Information Retrieval- 28
Can documents which do not contain any keywords of the
original query receive a positive similarity coefficient after
relevance feedback?
1. No
2. Yes, independent of the values β and γ
3. Yes, but only if β > 0
4. Yes, but only if γ > 0
Information Retrieval- 29
Which year Rocchio published his work on
relevance feedback?
A. 1965
B. 1975
C. 1985
D. 1995
Information Retrieval- 30
Pseudo-Relevance Feedback
If users do not give feedback, automate the process
– Choose the top-k documents as the relevant ones
– Extend the query by selecting from the top-k
documents the most relevant terms, according to some
weighting scheme
– Alternatively: apply the SMART algorithm
Information Retrieval- 31
Weighting Schemes
Information Retrieval- 32
1.2.6.2 Global Query Expansion
Query is expanded using a global, query-independent resource
• Manually edited thesaurus
• Automatically extracted thesaurus
• Query logs
Information Retrieval- 33
Manually Created Thesaurus
Expensive to create and maintain
• Used mainly in science and engineering
Example: Pubmed
https://www.ncbi.nlm.nih.gov/pubmed/?term=cancer
Information Retrieval- 34
Automatic Thesaurus Generation
Generate a thesaurus automatically by analyzing the distribution of words
in documents
– Problem: find words with similar meaning (synonyms)
Approach 1: Two words are similar if they co-occur with similar words
“switzerland” ≈ “austria” because both occur with words such as
“national”, “election”, “soccer” etc., so they must be similar.
Approach 2: Two words are similar if they occur in the same text pattern
“live in *”, “travel to *”, “size of *” are all phrases in which both
“switzerland” or “austria” can occur
Information Retrieval- 35
Expansion using Query Logs
Query logs are an important resource for query expansion with
search engines
• Exploit correlations in user sessions
Information Retrieval- 36
References
Papers
– Rocchio, J. (1971). Relevance feedback in information retrieval. The Smart retrieval
system-experiments in automatic document processing, 313-323.
– Ponte, Jay Michael, and W. Bruce Croft. "A language modeling approach to
information retrieval." PhD diss., University of Massachusetts at Amherst, 1998.
– Yoo, S., & Choi, J. (2011). Evaluation of term ranking algorithms for pseudo-relevance
feedback in MEDLINE retrieval. Healthcare informatics research, 17(2), 120-130.
Information Retrieval- 37