0809 Query Expansion and Probabilistic Retrieval Model
0809 Query Expansion and Probabilistic Retrieval Model
Introduction to
Information Retrieval
Block B
Lecture 8: Query expansion
and Probabilistic Retrieval
Introduction to Information Retrieval
6
Introduction to Information Retrieval
Relevance Feedback
Relevance feedback:
The user issues a (short, simple) query
The system offers results
The user marks some results as relevant or non-relevant.
The system computes a better representation of the
information need based on feedback.
Levels of awareness
Idea: it may be difficult to formulate a good query
when you don’t know the collection well, so iterate
Relevance feedback
We will use ad hoc retrieval to refer to regular
retrieval without relevance feedback.
Similar pages
snippet
description document A1,1 ,..., A1,m Column
vectors
A1,1 A1,m are the
T
A1,1
standard
A D1T
A
A A 1,m
n ,1 n ,m
description document An ,1 ,..., An ,m
T
An ,1
D1T DnT
A
Note that
T n ,m
DT D T
1
n A D1 , D2 ,..., Dn
T
DT
n
15
Introduction to Information Retrieval
A A
T
DT DT
n n
D1T D1T D1 D1 Dn
T
D1 , D2 ,..., Dn
DT T
Dn Dn
T
n Dn D1
Sim( D1 , D1 ) Sim( D1 , Dn )
Sim( D , D ) Sim( D , D )
n 1 n n
16
Introduction to Information Retrieval Sec. 9.1.1
Initial query/results
Initial query: New space satellite applications
Initial query/results
Initial query: New space satellite applications
New information from relevant documents:
1. 0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer
2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan
8. 0.509, 12/02/87, Telecommunications Tale of Two Companies
Assumptions:
- words in relevant documents are relevancy indicators
- words in nonrelevant documents indicate irrelevancy
Introduction to Information Retrieval Sec. 9.1.1
HOW DO WE DO IT?
29
Introduction to Information Retrieval Sec. 9.1.1
Relevancy computation
Assume query vector is normalized to length 1,
Feedback algorithm
The algorithm:
Evaluate query q
repeat
Offer k most relevant (yet unseen) documents: T
Ask feedback, splitting T into
set Cr of relevant documents and
set Cnr of nonrelevant documents.
Compute modified query qm
Evaluate modified query qm
until satisfied
32 2015/10/19 B1 - Relevance Feedback
Introduction to Information Retrieval
Optimal query
Problem:
given
set Cr of relevant documents
set Cnr of irrelevant documents
find a query q that best generalizes Cr and Cnr
R
Cr Cnr D
Optimal query
bonus for q: average similarity with (known) relevant
documents
1 1
bonus (q)
Cr dCr
Sim(d , q)
Cr dCr
d q
Optimization problem
Find query q with length 1, maximizing bonus - malus
score:
score(q)
1
d q 1
d q
Cr dC Cnr dC
r nr
1 1
d q d q
Cr d Cr Cnr d C nr
(Cr ) (Cnr ) q
qm q0 (Cr ) (Cnr )
qm q0 (Cr ) (Cnr )
Outline
81
Introduction to Information Retrieval
• Example:
– most students are young
– reversal: a young person is likely to be a student
– being young is an positive indication of being a student
83
Introduction to Information Retrieval
The algorithm
• In the probabilistic approach we use more or less the
same approach as for relevance feedback:
term weights pt and qt get some initial value.
repeat
Evaluate retrieval status value for each document
Offer k most relevant (unseen) documents: T
Ask feedback, splitting T into
set Cr of relevant documents and
set Cnr of nonrelevant documents.
Compute new weights for the terms (pt and qt)
until satisfied
87
Introduction to Information Retrieval
R S D
Example
a: System and human system engineering
terms: interface, user, system, human, computer,
testing of EPS response, time, EPS, trees, graph, minors, survey
b: A survey of user opinion of computer
system response time
c: The EPS user interface management
Relevant documents:
system
R = {f, g}
d: Human machine interface for ABC
computer applications Nonrelevant documents:
e: Relation of user perceived response S = {a, i}
time to error measurement
f: The generation of random, binary, p(trees) = 1
ordered trees q(system) = 0.5
p(graph) = 0.5 q(human) = 0.5
g: The intersection graph of paths in trees q(EPS) = 0.5
h: Graph minors IV: Widths of trees and q(graph) = 0.5
well-quasi-ordering q(minors) = 0.5
q(survey) = 0.5
i: Graph minors: A survey
89
Introduction to Information Retrieval
92
Introduction to Information Retrieval
Example
terms: interface, user, system, human, computer,
response, time, EPS, trees, graph, minors, survey
Consider document h:
Graph minors IV: Widths of trees and well-quasi-ordering
Incorporating knowledge
Introduction to Information Retrieval
Incorporating knowledge
• Given (a) document (like) d, what are its odds of being
relevant?
P(rel | d )
Odds(rel | d ) P(A | B)
P(B | A)P(A)
P(nrel | d ) P(B)
Simplification: independence
assumption
Introduction to Information Retrieval
P(d | rel)
t
P(d t | rel ) 1 if t d
dt
P(d | nrel) t P(d t | nrel) 0 otherwise
98
Introduction to Information Retrieval
td
pt
td
(1 pt )
td
qt
td
(1 qt )
100
Introduction to Information Retrieval
1 if t d
td
pt
td
(1 pt )
dt
dt
= 10 ifotherwise
td
qt td
(1 qt ) td,
0
t t
p dt
t t
(1 p )1 d t
t
otherwise
dt 1 d t
pt (1 pt )
q t t
dt
(1 q )
t t
1 d t
qt dt (1 qt )1 dt
To be simplified
To be simplified even further!
101
Introduction to Information Retrieval
Simplification
• Simplify this expression by taking the logarithm:
pt (1 pt )1 dt
dt
log t dt
1 d t
qt (1 qt )
pt 1 pt
t d t log (1 d t ) log
qt 1 qt
pt 1 qt 1 pt
t d t log t log
1 pt qt 1 qt
document dependent document independent
102
Introduction to Information Retrieval
Odds( p t )
t d log
Odds(q )
t
2
Odds( pt )
t d Odds(q )
t
104
Introduction to Information Retrieval
rt 0.5 st 0.5
mixture
pt qt
r 1 s 1 Stephen Robertson
• Instead of 0.5, alternative adjustments have been proposed
overall rt nt
nt rt nt
st
term pt N
qt N
prob r 1 n r 1 106
Introduction to Information Retrieval
No Relevance Info
(Croft &
• We will assume pi to be a constant (typically 0.5) Harper 79)
• Estimate qi by assuming all documents to be non-relevant
N rt
pt constant qt
rt
Bruce Croft
N rt 0.5
point-5 formula as extension qt
rt 1
David Harper
108
Introduction to Information Retrieval
EXERCISES
135
Introduction to Information Retrieval
The exercise
• Consider an excerpt of the cooperators of the Information
Sciences Department some years ago:
– Vaandrager, Jacobs, Mader, Dael, Janssen, Capretta, Sarbo,
Barendregt, Barendsen, Parijs, Basten, Paulussen, Tax, Schering,
Schouten, Achten, Peek, Weert, Fehnker
• Cooperators are characterized by letters in their family name,
using the following letters:
– a, e, n, r, t
• Assume some searcher is interested in those persons having
the letters 'r' and 'e' in their family name. The searcher starts
with the query ‘e'.
• Perform 3 cycles of the feedback algorithm with k=3, using
standard Rocchio (α = β = γ = 1).
Introduction to Information Retrieval
Exercises
2. Comment on the following statement:
Terms neither in positive (R) nor in negative documents (S), are
equally likely to be found in relevant and non-relevant documents
Compare this to the Robertson & Sparck Jones estimation for pt and qt.
137
Introduction to Information Retrieval
Exercises
4. We do a simulation of probabilistic retrieval
• Our collection consists of cooperators of the Information
Science department (April 21, 1998):
– The characterization of a cooperator is the set of letters in the family
name of this cooperator.
– Our (hidden) information need is the set of cooperators with a
specific letter in their family name.
Analysis
• Assume query: m
• Random set: Parijs, Giommi, Serrarens, Jones, Stefanova
– Relevant (R): Giommi
– Nonrelevant (S): Parijs, Serrarens, Jones, Stefanova
139
Introduction to Information Retrieval
END OF PRESENTATION
140