IR Lecture 5b
IR Lecture 5b
Evaluation
The major goal of IR is to search
document relevant to a user query.
The evaluation of the performance of IR
systems relies on the notion of
relevance.
What constitute relevance ?
Relevance
Relevance is subjective in nature i.e. it
depends upon a specific user’s judgment.
Given a query, the same document may be
judged as relevant by one user and non-
relevant by another user. Only the user can tell
the true relevance.
however not possible to measure this “true
relevance”
Most of the evaluation of IR systems so far has
been done on document test collections with
known relevance judgments.
Another issue with relevance is the
degree of relevance.
Traditionally, relevance has been
visualized as a binary concept i.e. a
document is judged either as relevant or
not relevant whereas relevance is a
continuous function (a document may
exactly what the user want or it may be
closely related)
Why System Evaluation?
All docs
Retrieved
RelRetrieved
Relevant
| RelRetrieved | | RelRetriev ed |
Precision Recall
| Retrieved | | Relevant |
These definitions of precision and recall are
based on binary relevance judgment, which
means that every retrievable item is
recognizably “relevant”, or recognizably “not
relevant”.
Hence, for every search result all retrievable
documents will be either
(i) relevant or non-relevant and
(ii) retrieved or not retrieved.
A B A B
Precision Recall
B A
0 1
Recall Returns most relevant
documents but includes
lots of junk
Test collection approach
The total number of relevant documents in a
collection must be known in order for recall to
be calculated.
To provide a framework of evaluation of IR
systems, a number of test collections have
been developed (Cranfield, TREC etc.).
These document collections are accompanied
by a set of queries and relevance judgments.
IR test collections
Collection Number of documents Number of queries
Cranfield 1400 225
CACM 3204 64
CISI 1460 112
LISA 6004 35
TIME 423 83
ADI 82 35
MEDLINE 1033 30
TREC-1 742,611 100
__________________________________________________________
_
Fixed Recall Levels
One way to evaluate is to look at average
precision at fixed recall levels
• Provides the information needed for
precision/recall graphs
Document Cutoff Levels
Another way to evaluate:
• Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
• top 50
• top 100
• Measure precision at each of these levels
• Take (weighted) average over results
focuses on how well the system ranks the first k
documents.
Computing Recall/Precision Points
NRrel
Let the average rank (AR) over the set of
relevant documents retrieved by the
system be: NRre
Rank r
AR r 1
NRrel
Rankr represents the rank of the rth relevant
document
The difference between AR and IR,
given by AR-IR, represents a measure of
the effectiveness of the system.
This difference ranges from 0 (for the
perfect retrieval ) to (N-NRrel) for worst
case retrieval
The expression AR-IR can be normalized by
dividing it by (N-NRrel) and then by
subtracting the result from 1, we get the
normalized recall (NR) given by:
AR IR
NR 1 -
( N NRrel )