0% found this document useful (0 votes)

5 views

Week 3 - Probabilistic Retrieval and Relevance Feedback

The document discusses probabilistic information retrieval models, focusing on the Query Likelihood Model and Language Modeling to estimate the relevance of documents to user queries. It highlights the importance of smoothing techniques to handle unseen terms and the role of user relevance feedback in improving query results. Additionally, it covers methods for query expansion, including local and global approaches, and the use of thesauri and query logs for enhancing search effectiveness.

Uploaded by

محمد وسيق شيراز

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Week 3 - Probabilistic Retrieval and Relevance Feedback

Uploaded by

محمد وسيق شيراز

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

1.2.

5 Probabilistic Information Retrieval

The notion of similarity in the vector space model does not have an
explanation how it relates to relevance
• The similarity values are just used to rank

An information retrieval model deals with uncertainty on the user's

information needs
• Probability theory provides a principled approach to reason about
this uncertainty

Probabilistic IR models attempt to provide an explainable model of

relevance

Information Retrieval- 1
Query Likelihood Model
Given query , determine the probability that document is relevant to
query

Bayes Rule
Assumptions
• , the probability of a document occurring is uniform across a collection
• is the same for all queries

Thus: can be derived from, the query likelyhood

Information Retrieval- 2
Language Modeling

Query likelihood: determine

Assume each document is generated by a Language Model

• a language model is a mechanism that generates the words of the language

Then can be interpreted as the probability that the query was

generated by the language model

Information Retrieval- 3
What is a Language Model?
Deterministic language model = automaton =
grammar information

retrieval

This model can produce:

“information retrieval”
“information retrieval information retrieval”
It cannot produce:
“retrieval information”

Information Retrieval- 4
Probabilistic Language Model
Unigram model: assign a probability to each term to appear
• More complex models can be used, e.g., bigrams

Model M1 Model M2
STOP 0.2 STOP 0.2
the 0.2 the 0.15
a 0.1 a 0.12
frog 0.03 frog 0.0002
Q toad 0.03 toad 0.0001
said 0.02 said 0.01
likes 0.015 likes 0.01
dog 0.01 dog 0.04

STOP Two different language models

derived from 2 documents

Information Retrieval- 5
Probability to Create a Query
What is the probability that a query has been generated by model ?

Example: = the frog said dog STOP

= 0.2 * 0.03 * 0.02 * 0.01 * 0.2 = 0.000 000 24
= 0.15 * 0.0002 * 0.01 * 0.04 * 0.25 = 0.000 000 003

Retrieval becomes the problem of computing for a query the probability

for all the documents

Information Retrieval- 6
Learning the Model
Learning the model means we have to estimate the probability of a
query to occur

First step: estimate how likely a single term occurs

Maximum Likelihood Estimation (MLE) of probabilities under Unigram

Model

where
• is the number of occurrences of in (term frequency)
• is the number of terms in the document (document length)

Information Retrieval- 7
Using the Model

Independence assumption: different terms in a query are assumed to

occur independently

Information Retrieval- 8
Consider the document:
“Information retrieval is the task of finding the documents satisfying
the information needs of the user”

Using MLE to estimate the unigram probability model, what

is P(the|Md) and P(information|Md)?

1. 1/16 and 1/16

2. 1/12 and 1/12
3. 1/4 and 1/8
4. 1/3 and 1/6

Information Retrieval- 9
Consider the following document

d = “information retrieval and search”

1. P(information search | Md) > P(information | Md)

2. P(information search | Md) = P(information | Md)
3. P(information search | Md) < P(information | Md)

Information Retrieval- 10
Issues with MLE
Problem 1: if the query contains a term not occurring in the document,
then !

Problem 2: this is an estimation! A term that occurs once, might have

been “lucky”, whereas another one with same probability to occur is
not contained in the document

 need to give non-zero probability to unseen terms!

Information Retrieval- 11
Smoothing
Idea: add a small weight for terms not occurring in a document
•the weight should be smaller than the normalized collection frequency

where
• = number of times term t occurs in collection
• = total number of terms in collection

Smoothed estimate

= language model of the whole collection

= tuning parameter Information Retrieval- 12
Probabilistic Retrieval
With smoothing the relevance is computed as

From a technical perspective the probabilities are computed using term

and document frequencies
– the same data is used as in vector space retrieval

Probabilistically motivated models show generally better performance

– But parameter tuning () is critical
– can be query-dependent, e.g., query size

Information Retrieval- 13
Example

Collection consisting of d1 and d2

d1: Einstein was one of the greatest scientists
d2: Albert Einstein received the Nobel prize

Query q: Albert Einstein

Using =1/2:
P(q|d1) = ½ * (0/7 + 1/13) * ½ * (1/7 + 2/13) ≈ 0.0057
P(q|d2) = ½ * (1/6 + 1/13) * ½ * (1/6 + 2/13) ≈ 0.0195
Albert Einstein

Information Retrieval- 14
Example: Comparing VS and PR

Ponte & Croft, 1998

Information Retrieval- 15
Properties of Retrieval Models

Vector Space Model Language Model BM25 (another

prob. Model)
Model geometric probabilistic probabilistic
Length normalization Requires extensions Inherent to model Tuning parameters
(pivot normalization)
Inverse document Used directly Smoothing and Used directly
frequency collection frequency
has similar effect
Multiple term Taken into account Taken into account Ignored
occurrences
Simplicity No tuning required Tuning essential Tuning essential

Information Retrieval- 16
1.2.6 Query Expansion

If the user query does not contain any relevant term, a

corresponding relevant document will not show up in
the result

Example: query “car” will not return “automobile”

How to add such documents (increase recall)?

Idea: System adds query terms to user query!

Information Retrieval- 17
Two Methods for Extending Queries

1. Local Approach:
• Use information from current query results:
user relevance feedback

2. Global Approach:
• Use information from a document collection:
query expansion

Information Retrieval- 18
1.2.6.1 User Relevance Feedback
information items
content
feature
extraction

ranking system ranked/

? binary
result

query
information need !
formulation browsing

system-modified query
(e.g. query term reweighting)
relevance feedback:
identify relevant results

Information Retrieval- 19
Feedback from Users

Relevant documents Cr Some retrieval result R

𝐷𝑟

documents identified documents identified

by the user as being by the user as being
relevant non-relevant

Information Retrieval- 20
Rocchio Algorithm
Rocchio algorithm: find a query that optimally separates
relevant from non-relevant documents

Centroid of a document set

Information Retrieval- 21
Illustration of Rocchio Algorithm

lower precision

lower recall

𝜇 ( 𝐷𝑟 )

Information Retrieval- 22
Illustration of Rocchio Algorithm

𝜇 ( 𝐷𝑟 ) − 𝜇 ( 𝐷 𝑛 )

𝜇 ( 𝐷𝑟 )
𝜇 ( 𝐷𝑛 )

Information Retrieval- 23
Illustration of Rocchio Algorithm

Optimal recall and precision

𝜇 ( 𝐷𝑟 ) + [ 𝜇 ( 𝐷𝑟 ) −𝜇 ( 𝐷𝑛 ) ]
𝜇 ( 𝐷𝑟 )

Information Retrieval- 24
Identifying Relevant Documents
Following the previous reasoning the optimal query is

Under cosine similarity

Practical issues
• User relevance feedback is not complete
• Users do not necessarily identify non-relevant documents
• Original query should continue to be considered

Information Retrieval- 25
SMART: Practical Relevance Feedback
Approximation scheme for the theoretically optimal query vector

If users identify some relevant documents Dr from the result set R

of a retrieval query q
– Assume all elements in R \ Dr are not relevant, i.e., Dn = R \ Dr
– Modify the query to approximate theoretically optimal query

– a, b, g are tuning parameters, a, b, g ≥ 0

– Example: a =1, b = 0.75, g = 0.25

Information Retrieval- 26
Example
Query q= "application theory"
Result
0.77: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
0.68: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
0.23: B11 Oscillation Theory for Neutral Differential Equations with Delay
0.23: B12 Oscillation Theory of Delay Differential Equations

Query reformulation

Result for reformulated query

0.87: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application

0.61: B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
0.29: B7 Knapsack Problems: Algorithms and Computer Implementations
0.23: B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry
and Commutative Algebra

Information Retrieval- 27
Discussion
Underlying assumptions of SMART algorithm
1. Original query contains sufficient number of relevant terms
2. Results contain new relevant terms that co-occur with original query
terms
3. Relevant documents form a single cluster
4. Users are willing to provide feedback (!)
All assumptions can be violated in practice
Practical considerations
• Modified queries are complex → expensive processing
• Explicit relevance feedback consumes user time → could be used in other
ways

Information Retrieval- 28
Can documents which do not contain any keywords of the
original query receive a positive similarity coefficient after
relevance feedback?
1. No
2. Yes, independent of the values β and γ
3. Yes, but only if β > 0
4. Yes, but only if γ > 0

Information Retrieval- 29
Which year Rocchio published his work on
relevance feedback?
A. 1965
B. 1975
C. 1985
D. 1995

Information Retrieval- 30
Pseudo-Relevance Feedback
If users do not give feedback, automate the process
– Choose the top-k documents as the relevant ones
– Extend the query by selecting from the top-k
documents the most relevant terms, according to some
weighting scheme
– Alternatively: apply the SMART algorithm

Works often well

– But can fail horribly: query drift

Information Retrieval- 31
Weighting Schemes

Information Retrieval- 32
1.2.6.2 Global Query Expansion
Query is expanded using a global, query-independent resource
• Manually edited thesaurus
• Automatically extracted thesaurus
• Query logs

Information Retrieval- 33
Manually Created Thesaurus
Expensive to create and maintain
• Used mainly in science and engineering
Example: Pubmed

https://www.ncbi.nlm.nih.gov/pubmed/?term=cancer

Information Retrieval- 34
Automatic Thesaurus Generation
Generate a thesaurus automatically by analyzing the distribution of words
in documents
– Problem: find words with similar meaning (synonyms)

Approach 1: Two words are similar if they co-occur with similar words
“switzerland” ≈ “austria” because both occur with words such as
“national”, “election”, “soccer” etc., so they must be similar.

Approach 2: Two words are similar if they occur in the same text pattern
“live in *”, “travel to *”, “size of *” are all phrases in which both
“switzerland” or “austria” can occur

Information Retrieval- 35
Expansion using Query Logs
Query logs are an important resource for query expansion with
search engines
• Exploit correlations in user sessions

Example 1: users extend query

• After searching “Obama”, users search “Obama president”
• Therefore, ”president” might be a good expansion
Example 2: users refer to same result
• User A accesses URL epfl.ch after searching “Aebischer”
• User B accesses URL epfl.ch after searching “Vetterli”
• “Vetterli” might be a potential expansion for the query “Aebischer”

Information Retrieval- 36
References
Papers
– Rocchio, J. (1971). Relevance feedback in information retrieval. The Smart retrieval
system-experiments in automatic document processing, 313-323.
– Ponte, Jay Michael, and W. Bruce Croft. "A language modeling approach to
information retrieval." PhD diss., University of Massachusetts at Amherst, 1998.
– Yoo, S., & Choi, J. (2011). Evaluation of term ranking algorithms for pseudo-relevance
feedback in MEDLINE retrieval. Healthcare informatics research, 17(2), 120-130.

Information Retrieval- 37

Design and Simulation of 8 Bit Arithmetic Logic Unit
100% (1)
Design and Simulation of 8 Bit Arithmetic Logic Unit
88 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
ir_answers
No ratings yet
ir_answers
15 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
Chapter 4
No ratings yet
Chapter 4
48 pages
IR Lecture 6b
No ratings yet
IR Lecture 6b
45 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IR_MOD2_NOTES
No ratings yet
IR_MOD2_NOTES
26 pages
Probabilistic Information Retrieval Model
No ratings yet
Probabilistic Information Retrieval Model
51 pages
Lec7 Probabilistic Ir
No ratings yet
Lec7 Probabilistic Ir
52 pages
5 Probabilistic Retrieval FSS20
No ratings yet
5 Probabilistic Retrieval FSS20
22 pages
F-IR
No ratings yet
F-IR
30 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Unit II
No ratings yet
Unit II
73 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
4_IRModels
No ratings yet
4_IRModels
46 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
IR - Models
100% (3)
IR - Models
58 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Vector Space Model and Features: Carl Staelin
No ratings yet
Vector Space Model and Features: Carl Staelin
28 pages
L03
No ratings yet
L03
16 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
18 pages
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
No ratings yet
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
108 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Probabilistic Model
No ratings yet
Probabilistic Model
46 pages
1 Overview
No ratings yet
1 Overview
44 pages
Supervisionguide15 16 Students
No ratings yet
Supervisionguide15 16 Students
18 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
2
No ratings yet
2
17 pages
Smarter Decisions – The Intersection of Internet of Things and Decision Science
From Everand
Smarter Decisions – The Intersection of Internet of Things and Decision Science
Jojo Moolayil
No ratings yet
Manual de Pararrayos ABB
No ratings yet
Manual de Pararrayos ABB
24 pages
Int Eract Ive W Hit Eboard
No ratings yet
Int Eract Ive W Hit Eboard
16 pages
Ip Practical Work
No ratings yet
Ip Practical Work
37 pages
Brake System
No ratings yet
Brake System
11 pages
A Wireless and Wearable System For Fetal Heart Rate Monitoring
No ratings yet
A Wireless and Wearable System For Fetal Heart Rate Monitoring
6 pages
Image Reconstruction Using Deep Learning
No ratings yet
Image Reconstruction Using Deep Learning
10 pages
SFDMagnezAshwagandha2bB6 638336543944119105
No ratings yet
SFDMagnezAshwagandha2bB6 638336543944119105
2 pages
Weebly CV
No ratings yet
Weebly CV
1 page
simple-machines-handout
No ratings yet
simple-machines-handout
2 pages
Electronic Effects On Ketoe Enol Tautomerism of P-Substituted
No ratings yet
Electronic Effects On Ketoe Enol Tautomerism of P-Substituted
8 pages
46 10 HSC Brochure - PDF - Marked Up
No ratings yet
46 10 HSC Brochure - PDF - Marked Up
4 pages
Cs Notes Database Concepts Class12
No ratings yet
Cs Notes Database Concepts Class12
3 pages
FEB2019
No ratings yet
FEB2019
11 pages
2 RRU DBS and BTS Introduction and Hardware Installation PDF
No ratings yet
2 RRU DBS and BTS Introduction and Hardware Installation PDF
74 pages
Theory of Structures
No ratings yet
Theory of Structures
1 page
Pacsystems Rxi Box Ipc-Ep Ds Gfa1910
No ratings yet
Pacsystems Rxi Box Ipc-Ep Ds Gfa1910
2 pages
LM324 PDF
No ratings yet
LM324 PDF
4 pages
ASTM C 1528 (Specification Dimension Stones)
No ratings yet
ASTM C 1528 (Specification Dimension Stones)
17 pages
Plumbing Glossary
No ratings yet
Plumbing Glossary
45 pages
Adobe Scan 16 Nov 2022
No ratings yet
Adobe Scan 16 Nov 2022
7 pages
Int Lesson 2 - Possibility Probability Obligation
No ratings yet
Int Lesson 2 - Possibility Probability Obligation
16 pages
Chapter3 - Basic Processing Unit
No ratings yet
Chapter3 - Basic Processing Unit
47 pages
022 - 01 - 01 - Pressure Gauge - Ans
No ratings yet
022 - 01 - 01 - Pressure Gauge - Ans
2 pages
Operations Research
No ratings yet
Operations Research
2 pages
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
100% (1)
PRACTICAL LIST CLASS-XII (INFO. PRACTICALS - fINAL PDF
8 pages
Model Question Paper Final
No ratings yet
Model Question Paper Final
4 pages
Year 7 Maths Test - Fractions - Answers
No ratings yet
Year 7 Maths Test - Fractions - Answers
5 pages
Quarter 2 - Worksheet No. 1 - Melc: S9Mtiib-14
No ratings yet
Quarter 2 - Worksheet No. 1 - Melc: S9Mtiib-14
3 pages
Ghatak Research
No ratings yet
Ghatak Research
6 pages

Week 3 - Probabilistic Retrieval and Relevance Feedback

Uploaded by

Week 3 - Probabilistic Retrieval and Relevance Feedback

Uploaded by

1.2.

5 Probabilistic Information Retrieval

An information retrieval model deals with uncertainty on the user's

Probabilistic IR models attempt to provide an explainable model of

Thus: can be derived from, the query likelyhood

Query likelihood: determine

Assume each document is generated by a Language Model

Then can be interpreted as the probability that the query was

This model can produce:

STOP Two different language models

Example: = the frog said dog STOP

Retrieval becomes the problem of computing for a query the probability

First step: estimate how likely a single term occurs

Maximum Likelihood Estimation (MLE) of probabilities under Unigram

Independence assumption: different terms in a query are assumed to

Using MLE to estimate the unigram probability model, what

1. 1/16 and 1/16

d = “information retrieval and search”

1. P(information search | Md) > P(information | Md)

Problem 2: this is an estimation! A term that occurs once, might have

 need to give non-zero probability to unseen terms!

= language model of the whole collection

From a technical perspective the probabilities are computed using term

Probabilistically motivated models show generally better performance

Collection consisting of d1 and d2

Query q: Albert Einstein

Ponte & Croft, 1998

Vector Space Model Language Model BM25 (another

If the user query does not contain any relevant term, a

Example: query “car” will not return “automobile”

How to add such documents (increase recall)?

ranking system ranked/

Relevant documents Cr Some retrieval result R

documents identified documents identified

Centroid of a document set

Optimal recall and precision

Under cosine similarity

If users identify some relevant documents Dr from the result set R

– a, b, g are tuning parameters, a, b, g ≥ 0

Result for reformulated query

0.87: B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application

Works often well

Example 1: users extend query

You might also like