0% found this document useful (0 votes)
5 views

Week 3 - Probabilistic Retrieval and Relevance Feedback

The document discusses probabilistic information retrieval models, focusing on the Query Likelihood Model and Language Modeling to estimate the relevance of documents to user queries. It highlights the importance of smoothing techniques to handle unseen terms and the role of user relevance feedback in improving query results. Additionally, it covers methods for query expansion, including local and global approaches, and the use of thesauri and query logs for enhancing search effectiveness.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week 3 - Probabilistic Retrieval and Relevance Feedback

The document discusses probabilistic information retrieval models, focusing on the Query Likelihood Model and Language Modeling to estimate the relevance of documents to user queries. It highlights the importance of smoothing techniques to handle unseen terms and the role of user relevance feedback in improving query results. Additionally, it covers methods for query expansion, including local and global approaches, and the use of thesauri and query logs for enhancing search effectiveness.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

1.2.

5 Probabilistic Information Retrieval


The notion of similarity in the vector space model does not have an
explanation how it relates to relevance
• The similarity values are just used to rank

An information retrieval model deals with uncertainty on the user's


information needs
• Probability theory provides a principled approach to reason about
this uncertainty

Probabilistic IR models attempt to provide an explainable model of


relevance

Information Retrieval- 1
Query Likelihood Model
Given query , determine the probability that document is relevant to
query

Bayes Rule
Assumptions
• , the probability of a document occurring is uniform across a collection
• is the same for all queries

Thus: can be derived from, the query likelyhood

Information Retrieval- 2
Language Modeling

Query likelihood: determine

Assume each document is generated by a Language Model


• a language model is a mechanism that generates the words of the language

Then can be interpreted as the probability that the query was


generated by the language model

Information Retrieval- 3
What is a Language Model?
Deterministic language model = automaton =
grammar information

retrieval

This model can produce:


“information retrieval”
“information retrieval information retrieval”
It cannot produce:
“retrieval information”

Information Retrieval- 4
Probabilistic Language Model
Unigram model: assign a probability to each term to appear
• More complex models can be used, e.g., bigrams

Model M1 Model M2
STOP 0.2 STOP 0.2
the 0.2 the 0.15
a 0.1 a 0.12
frog 0.03 frog 0.0002
Q toad 0.03 toad 0.0001
said 0.02 said 0.01
likes 0.015 likes 0.01
dog 0.01 dog 0.04

STOP Two different language models


derived from 2 documents

Information Retrieval- 5
Probability to Create a Query
What is the probability that a query has been generated by model ?

Example: = the frog said dog STOP


= 0.2 * 0.03 * 0.02 * 0.01 * 0.2 = 0.000 000 24
= 0.15 * 0.0002 * 0.01 * 0.04 * 0.25 = 0.000 000 003

Retrieval becomes the problem of computing for a query the probability


for all the documents

Information Retrieval- 6
Learning the Model
Learning the model means we have to estimate the probability of a
query to occur

First step: estimate how likely a single term occurs

Maximum Likelihood Estimation (MLE) of probabilities under Unigram


Model

where
• is the number of occurrences of in (term frequency)
• is the number of terms in the document (document length)

Information Retrieval- 7
Using the Model

Independence assumption: different terms in a query are assumed to


occur independently

Information Retrieval- 8
Consider the document:
“Information retrieval is the task of finding the documents satisfying
the information needs of the user”

Using MLE to estimate the unigram probability model, what


is P(the|Md) and P(information|Md)?

1. 1/16 and 1/16


2. 1/12 and 1/12
3. 1/4 and 1/8
4. 1/3 and 1/6

Information Retrieval- 9
Consider the following document

d = “information retrieval and search”

1. P(information search | Md) > P(information | Md)


2. P(information search | Md) = P(information | Md)
3. P(information search | Md) < P(information | Md)

Information Retrieval- 10
Issues with MLE
Problem 1: if the query contains a term not occurring in the document,
then !

Problem 2: this is an estimation! A term that occurs once, might have


been “lucky”, whereas another one with same probability to occur is
not contained in the document

 need to give non-zero probability to unseen terms!

Information Retrieval- 11
Smoothing
Idea: add a small weight for terms not occurring in a document
•the weight should be smaller than the normalized collection frequency

where
• = number of times term t occurs in collection
• = total number of terms in collection

Smoothed estimate

= language model of the whole collection


= tuning parameter Information Retrieval- 12
Probabilistic Retrieval
With smoothing the relevance is computed as

From a technical perspective the probabilities are computed using term


and document frequencies
– the same data is used as in vector space retrieval

Probabilistically motivated models show generally better performance


– But parameter tuning () is critical
– can be query-dependent, e.g., query size

Information Retrieval- 13
Example

Collection consisting of d1 and d2


d1: Einstein was one of the greatest scientists
d2: Albert Einstein received the Nobel prize

Query q: Albert Einstein

Using =1/2:
P(q|d1) = ½ * (0/7 + 1/13) * ½ * (1/7 + 2/13) ≈ 0.0057
P(q|d2) = ½ * (1/6 + 1/13) * ½ * (1/6 + 2/13) ≈ 0.0195
Albert Einstein

Information Retrieval- 14
Example: Comparing VS and PR

Ponte & Croft, 1998

Information Retrieval- 15
Properties of Retrieval Models

Vector Space Model Language Model BM25 (another


prob. Model)
Model geometric probabilistic probabilistic
Length normalization Requires extensions Inherent to model Tuning parameters
(pivot normalization)
Inverse document Used directly Smoothing and Used directly
frequency collection frequency
has similar effect
Multiple term Taken into account Taken into account Ignored
occurrences
Simplicity No tuning required Tuning essential Tuning essential

Information Retrieval- 16
1.2.6 Query Expansion

If the user query does not contain any relevant term, a


corresponding relevant document will not show up in
the result

Example: query “car” will not return “automobile”

How to add such documents (increase recall)?


Idea: System adds query terms to user query!

Information Retrieval- 17
Two Methods for Extending Queries

1. Local Approach:
• Use information from current query results:
user relevance feedback

2. Global Approach:
• Use information from a document collection:
query expansion

Information Retrieval- 18
1.2.6.1 User Relevance Feedback
information items
content
feature
extraction

ranking system ranked/


? binary
result

query
information need !
formulation browsing

system-modified query
(e.g. query term reweighting)
relevance feedback:
identify relevant results

Information Retrieval- 19
Feedback from Users

Relevant documents Cr Some retrieval result R

𝐷𝑟

documents identified documents identified


by the user as being by the user as being
relevant non-relevant

Information Retrieval- 20
Rocchio Algorithm
Rocchio algorithm: find a query that optimally separates
relevant from non-relevant documents

Centroid of a document set

Information Retrieval- 21
Illustration of Rocchio Algorithm

lower precision

lower recall

𝜇 ( 𝐷𝑟 )

Information Retrieval- 22
Illustration of Rocchio Algorithm

𝜇 ( 𝐷𝑟 ) − 𝜇 ( 𝐷 𝑛 )

𝜇 ( 𝐷𝑟 )
𝜇 ( 𝐷𝑛 )

Information Retrieval- 23
Illustration of Rocchio Algorithm

Optimal recall and precision

𝜇 ( 𝐷𝑟 ) + [ 𝜇 ( 𝐷𝑟 ) −𝜇 ( 𝐷𝑛 ) ]
𝜇 ( 𝐷𝑟 )

Information Retrieval- 24
Identifying Relevant Documents
Following the previous reasoning the optimal query is

Under cosine similarity

Practical issues
• User relevance feedback is not complete
• Users do not necessarily identify non-relevant documents
• Original query should continue to be considered

Information Retrieval- 25
SMART: Practical Relevance Feedback
Approximation scheme for the theoretically optimal query vector

If users identify some relevant documents Dr from the result set R


of a retrieval query q
– Assume all elements in R \ Dr are not relevant, i.e., Dn = R \ Dr
– Modify the query to approximate theoretically optimal query

– a, b, g are tuning parameters, a, b, g ≥ 0


– Example: a =1, b = 0.75, g = 0.25

Information Retrieval- 26
Example
Query q= "application theory"
Result
0.77: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
0.68: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
0.23: B11 Oscillation Theory for Neutral Differential Equations with Delay
0.23: B12 Oscillation Theory of Delay Differential Equations

Query reformulation

Result for reformulated query

0.87: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application


0.61: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
0.29: B7 Knapsack Problems: Algorithms and Computer Implementations
0.23: B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry
and Commutative Algebra

Information Retrieval- 27
Discussion
Underlying assumptions of SMART algorithm
1. Original query contains sufficient number of relevant terms
2. Results contain new relevant terms that co-occur with original query
terms
3. Relevant documents form a single cluster
4. Users are willing to provide feedback (!)
All assumptions can be violated in practice
Practical considerations
• Modified queries are complex → expensive processing
• Explicit relevance feedback consumes user time → could be used in other
ways

Information Retrieval- 28
Can documents which do not contain any keywords of the
original query receive a positive similarity coefficient after
relevance feedback?
1. No
2. Yes, independent of the values β and γ
3. Yes, but only if β > 0
4. Yes, but only if γ > 0

Information Retrieval- 29
Which year Rocchio published his work on
relevance feedback?
A. 1965
B. 1975
C. 1985
D. 1995

Information Retrieval- 30
Pseudo-Relevance Feedback
If users do not give feedback, automate the process
– Choose the top-k documents as the relevant ones
– Extend the query by selecting from the top-k
documents the most relevant terms, according to some
weighting scheme
– Alternatively: apply the SMART algorithm

Works often well


– But can fail horribly: query drift

Information Retrieval- 31
Weighting Schemes

Information Retrieval- 32
1.2.6.2 Global Query Expansion
Query is expanded using a global, query-independent resource
• Manually edited thesaurus
• Automatically extracted thesaurus
• Query logs

Information Retrieval- 33
Manually Created Thesaurus
Expensive to create and maintain
• Used mainly in science and engineering
Example: Pubmed

https://www.ncbi.nlm.nih.gov/pubmed/?term=cancer

Information Retrieval- 34
Automatic Thesaurus Generation
Generate a thesaurus automatically by analyzing the distribution of words
in documents
– Problem: find words with similar meaning (synonyms)

Approach 1: Two words are similar if they co-occur with similar words
“switzerland” ≈ “austria” because both occur with words such as
“national”, “election”, “soccer” etc., so they must be similar.

Approach 2: Two words are similar if they occur in the same text pattern
“live in *”, “travel to *”, “size of *” are all phrases in which both
“switzerland” or “austria” can occur

Information Retrieval- 35
Expansion using Query Logs
Query logs are an important resource for query expansion with
search engines
• Exploit correlations in user sessions

Example 1: users extend query


• After searching “Obama”, users search “Obama president”
• Therefore, ”president” might be a good expansion
Example 2: users refer to same result
• User A accesses URL epfl.ch after searching “Aebischer”
• User B accesses URL epfl.ch after searching “Vetterli”
• “Vetterli” might be a potential expansion for the query “Aebischer”

Information Retrieval- 36
References
Papers
– Rocchio, J. (1971). Relevance feedback in information retrieval. The Smart retrieval
system-experiments in automatic document processing, 313-323.
– Ponte, Jay Michael, and W. Bruce Croft. "A language modeling approach to
information retrieval." PhD diss., University of Massachusetts at Amherst, 1998.
– Yoo, S., & Choi, J. (2011). Evaluation of term ranking algorithms for pseudo-relevance
feedback in MEDLINE retrieval. Healthcare informatics research, 17(2), 120-130.

Information Retrieval- 37

You might also like