Lecture15 Learning Ranking
Lecture15 Learning Ranking
Introduction to
Information Retrieval
CS276: Information Retrieval and Web
Search
Christopher Manning and Pandu Nayak
Simple example:
Using classification for ad hoc IR
Collect a training corpus of (q, d, r) triples
Relevance r is here binary (but may be multiclass, with 3–7
values)
Document is represented by a feature vector
x = (α, ω) α is cosine similarity, ω is minimum query
window size
ω is the the shortest text span that includes all query
words
Query term proximity is a very important new weighting
factor
Train a machine learning model to predict the class r of
a document-query pair
Introduction to Information Retrieval Sec. 15.4.1
Simple example:
Using classification for ad hoc IR
A linear score function is then
Score(d, q) = Score(α, ω) = aα + bω + c
And the linear classifier is
Decide relevant if Score(d, q) > θ
Simple example:
Using classification for ad hoc IR
0.05
cosine score
R Decision
R N surface
R R
R
R R
N N
0.025 R
R R
R N N
N
N
N
N N
0
2 3 4 5
Term proximity
Introduction to Information Retrieval
“Learning to rank”
Classification probably isn’t the right way to think
about approaching ad hoc IR:
Classification problems: Map to a unordered set of
classes
Regression problems: Map to a real value
Ordinal regression problems: Map to an ordered set
of classes
A fairly obscure sub-branch of statistics, but what we want
here
This formulation gives extra power:
Relations between relevance levels are modeled
Documents are good versus other documents for
Introduction to Information Retrieval
“Learning to rank”
Assume a number of categories C of relevance
exist
These are totally ordered: c1 < c2 < … < cJ
This is the ordinal regression setup
Assume training data is available consisting of
document-query pairs represented as feature
vectors ψi and relevance ranking ci
Point-wise learning
Goal is to learn a threshold to separate each
rank
Pairwise learning: The Ranking
Introduction to Information Retrieval Sec. 15.4.2
SVM
[Herbrich et al. 1999, 2000; Joachims et al. 2002]
Loss
1 wΦu
Introduction to Information Retrieval
q1: d p p n n n n
q2: d d p p p n n n n n
q1 pairs: 2*(d, p) + 4*(d, n) + 8*(p, n) = 14
q2 pairs: 6*(d, p) + 10*(d, n) + 15*(p, n) = 31
Introduction to Information Retrieval
Experiments
OHSUMED (from LETOR)
Features:
6 that represent versions of tf, idf, and tf.idf factors
BM25 score (IIR sec. 11.4.3)
A scoring function derived from a probabilistic approach to
IR, which has traditionally done well in TREC evaluations,
etc.
Introduction to Information Retrieval
Discontinuity Example
NDCG discontinuous
Rank 1 2 3
w.r.t model parameters! Relevance 0 1 0
Introduction to Information Retrieval
Summary
The idea of learning ranking functions has been
around for about 20 years
But only recently have ML knowledge, availability of
training datasets, a rich space of features, and
massive computation come together to make this a
hot research area
It’s too early to give a definitive statement on what
methods are best in this area … it’s still advancing
rapidly
But machine learned ranking over many features
now easily beats traditional hand-designed ranking
functions in comparative evaluations [in part by using the hand-
designed functions as features!]
Resources
IIR secs 6.1.2–3 and 15.4
LETOR benchmark datasets
Website with data, links to papers, benchmarks, etc.
http://research.microsoft.com/users/LETOR/
Everything you need to start research in this area!
Nallapati, R. Discriminative models for information
retrieval. SIGIR 2004.
Cao, Y., Xu, J. Liu, T.-Y., Li, H., Huang, Y. and Hon,
H.-W. Adapting Ranking SVM to Document
Retrieval, SIGIR 2006.
Y. Yue, T. Finley, F. Radlinski, T. Joachims. A
Support Vector Method for Optimizing Average
Precision. SIGIR 2007.