Introduction To: Information Retrieval
Introduction To: Information Retrieval
Introduction to
Information Retrieval
CS276
Information Retrieval and Web Search
Pandu Nayak and Prabhakar Raghavan
Lecture 7: Scoring and results assembly
Introduction to Information Retrieval
Recap: cosine(query,document)
Dot product Unit vectors
V
qd q d q di
i 1 i
cos(q , d )
q
qd d
i1 i
V 2 V
2
q
i 1 i
d
This lecture
Speeding up vector space ranking
Putting together a complete search
system
Will require learning about a number of
miscellaneous topics and heuristics
Introduction to Information Retrieval Sec. 6.3.3
.3 .8 .1
.1
Introduction to Information Retrieval Sec. 7.1.1
Bottlenecks
Primary computational bottleneck in scoring: cosine
computation
Can we avoid all this computation?
Yes, but may sometimes get it wrong
a doc not in the top K may creep into the list of K
output docs
Is this such a bad thing?
Introduction to Information Retrieval Sec. 7.1.1
Generic approach
Find a set A of contenders, with K < |A| << N
A does not necessarily contain the top K, but has
many docs from among the top K
Return the top K docs in A
Think of A as pruning non-contenders
The same approach is also used for other (non-
cosine) scoring functions
Will look at several schemes following this approach
Introduction to Information Retrieval Sec. 7.1.2
Index elimination
Basic algorithm cosine computation algorithm only
considers docs containing at least one query term
Take this further:
Only consider high-idf query terms
Only consider docs containing many query terms
Introduction to Information Retrieval Sec. 7.1.2
3 of 4 query terms
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
Champion lists
Precompute for each dictionary term t, the r docs of
highest weight in t’s postings
Call this the champion list for t
(aka fancy list or top docs for t)
Note that r has to be chosen at index build time
Thus, it’s possible that r < K
At query time, only compute scores for docs in the
champion list of some query term
Pick the K top-scoring docs from amongst these
Introduction to Information Retrieval Sec. 7.1.3
Exercises
How do Champion Lists relate to Index Elimination?
Can they be used together?
How can Champion Lists be implemented in an
inverted index?
Note that the champion list has nothing to do with small
docIDs
Introduction to Information Retrieval Sec. 7.1.4
Modeling authority
Assign to each document a query-independent
quality score in [0,1] to each document d
Denote this by g(d)
Thus, a quantity like the number of citations is scaled
into [0,1]
Exercise: suggest a formula for this.
Introduction to Information Retrieval Sec. 7.1.4
Net score
Consider a simple total score combining cosine
relevance and authority
net-score(q,d) = g(d) + cosine(q,d)
Can use some other linear combination
Indeed, any function of the two “signals” of user happiness
– more later
Now we seek the top K docs by net score
Introduction to Information Retrieval Sec. 7.1.4
Impact-ordered postings
We only want to compute scores for docs for which
wft,d is high enough
We sort each postings list by wft,d
Now: not all postings in a common order!
How do we compute scores in order to pick off top
K?
Two ideas follow
Introduction to Information Retrieval Sec. 7.1.5
1. Early termination
When traversing t’s postings, stop early after either
a fixed number of r docs
wft,d drops below some threshold
Take the union of the resulting sets of docs
One from the postings of each query term
Compute only the scores for docs in this union
Introduction to Information Retrieval Sec. 7.1.5
2. idf-ordered terms
When considering the postings of query terms
Look at them in order of decreasing idf
High idf terms likely to contribute most to score
As we update score contribution from each query
term
Stop if doc scores relatively unchanged
Can apply to cosine or some other net scores
Introduction to Information Retrieval Sec. 7.1.6
Visualization
Query
Leader Follower
Introduction to Information Retrieval Sec. 7.1.6
General variants
Have each follower attached to b1=3 (say) nearest
leaders.
From query, find b2=4 (say) nearest leaders and their
followers.
Can recurse on leader/follower construction.
Introduction to Information Retrieval Sec. 7.1.6
Exercises
To find the nearest leader in step 1, how many cosine
computations do we do?
Why did we have N in the first place?
What is the effect of the constants b1, b2 on the
previous slide?
Devise an example where this is likely to fail – i.e., we
miss one of the K nearest docs.
Likely under random sampling.
Introduction to Information Retrieval Sec. 6.1
Fields
We sometimes wish to search by these metadata
E.g., find docs authored by William Shakespeare in the
year 1601, containing alas poor Yorick
Year = 1601 is an example of a field
Also, author last name = shakespeare, etc.
Field or parametric index: postings for each field
value
Sometimes build range trees (e.g., for dates)
Field query typically treated as conjunction
(doc must be authored by shakespeare)
Introduction to Information Retrieval Sec. 6.1
Zone
A zone is a region of the doc that can contain an
arbitrary amount of text, e.g.,
Title
Abstract
References …
Build inverted indexes on zones as well to permit
querying
E.g., “find docs with merchant in the title zone and
matching the query gentle rain”
Introduction to Information Retrieval Sec. 6.1
Tiered indexes
Break postings up into a hierarchy of lists
Most important
…
Least important
Can be done by g(d) or another measure
Inverted index thus broken up into tiers of decreasing
importance
At query time use top tier unless it fails to yield K
docs
If so drop to lower tiers
Introduction to Information Retrieval Sec. 7.2.1
Query parsers
Free text query from user may in fact spawn one or
more queries to the indexes, e.g., query rising
interest rates
Run the query as a phrase query
If <K docs contain the phrase rising interest rates, run the
two phrase queries rising interest and interest rates
If we still have <K docs, run the vector space query rising
interest rates
Rank matching docs by vector space scoring
This sequence is issued by a query parser
Introduction to Information Retrieval Sec. 7.2.3
Aggregate scores
We’ve seen that score functions can combine cosine,
static quality, proximity, etc.
How do we know the best combination?
Some applications – expert-tuned
Increasingly common: machine-learned
See May 19th lecture
Introduction to Information Retrieval Sec. 7.2.4
Resources
IIR 7, 6.1