0% found this document useful (0 votes)
22 views

TREC Experiment and Evaluation in Inform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

TREC Experiment and Evaluation in Inform

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Computing Reviews, the leading online review service for computing lite... http://www.reviews.com/review/Review_review.cfm?review_id=133676...

Search

TREC : experiment and evaluation in information retrieval (Digital Libraries and


Electronic Publishing)
Voorhees E., Harman D., The MIT Press, 2005. 368 pp. Type: Book

Date Reviewed: Dec 7 2006

The Text Retrieval Evaluation Conference (TREC), coordinated by the US National Institute
of Standards and Technology (NIST), is the largest information retrieval (IR)
experimentation effort in existence. Starting with TREC-1 in 1992, and continuing yearly, Information
Search And
TREC gives participating groups the opportunity to have their IR systems compete in
Retrieval (H.3.3
several IR experiments, called tracks. TREC has had a big influence on research in particular
)
approaches to IR: tracks have often initiated small research communities around a problem,
and TREC has occupied a large segment of the IR community as a whole. Thus, whatever Digital Libraries
one may think about the TREC approach to IR testing, a book detailing the methods used (H.3.7 )
and results achieved (through 2003) is important. This book is a useful overview for Large Text
researchers in the field, a must-read for prospective TREC participants, and a glimpse into a Archives (H.3.6
world of research for graduate students. ... )
Library
The book has three parts and an epilogue. Part 1 presents the essentials. TREC is based on
Automation
the Cranfield paradigm [1]. Chapter 1 quotes the oft-repeated Cranfield “conclusion” that (H.3.6 )
“using words in the texts themselves was very effective” (page 3). What Cranfield actually
showed, however, is that systems that compute expected relevance scores by matching System
query words with title words agree, to a great extent, with human judges, whose relevance Management
(K.6.4 )
judgments are heavily influenced by the match of query words with title words (hardly a
surprising result). This can be seen in side studies on the nature of the relevance judgments Systems And
[1]. Chapter 2 describes the test collection corpus, the creation of topics (information needs Software (H.3.4
descriptions), and the relevance judgments. I would have liked to see more about the )
instructions given to relevance judges, and thus the nature of the relevance judgments,
which are a crucial element. Chapter 3 discusses retrieval performance measures, with a
focus on the monolingual English ad hoc track in TREC-1 and TREC-2.

Part 2 (chapters 4 through 10) reports on the various TREC tracks, A track is a specific
experiment, defined by, first, the type of task (ad hoc retrieval, filtering,
question-answering, and so on); second, the type of material (printed text, spoken text,
images, music, and so on); third, the presence of errors in the text (from optical character
recognition (OCR) or automatic speech recognition); fourth, whether the data is
monolingual or cross-lingual; and, fifth, the language(s) involved (most tracks are
monolingual (English)). Each track report describes, over the life of the track, the specific
task, the assembly and size of the test collection, the participants, the methods and
evaluation measures used, and the results achieved (unfortunately, not including the
overlap in retrieval by the different systems).

In Part 3 (chapters 11 through 17), selected participants report on their work at TREC. Parts
2 and 3 give complementary views of the work at TREC. Think of a table, with a column for
each track and a row for each participating group. A chapter in Part 2 reports on the total
work in a column (a track), both globally and by participant (a cell in the table); a chapter
in Part 3 reports on the total work in a row (a research group), both globally and by track (a
cell in the table). There should be more cross-references between Parts 2 and 3 to connect
information on the same table cell given in different places.

The epilogue, by Karen Sparck-Jones, is a “metareflection” on what was learned from TREC,
and on the future of TREC. It is not, nor can it be, a summary and inventory of IR
techniques, evaluation methods, and results emanating from TREC; it is, rather, a
high-level summary, commentary, and development of vision, particularly with regard to
the Web, intranets, and digital libraries. The “Reference Summary” and “TREC Messages”
sections should have been integrated into chapter 1, to provide better structure and a
high-level perspective from the outset. These sections make some bold claims about the
success of fully automated methods in IR, but these claims are not supported (see below).
Sections 7 and 8, short but most important, paint a vision of an integrated information

1 of 3 1/14/2007 7:00 PM
Computing Reviews, the leading online review service for computing lite... http://www.reviews.com/review/Review_review.cfm?review_id=133676...

management system that lets the user execute and combine several tasks, such as
document retrieval, information extraction, topic detection and tracking, (multi-document)
summarization, and translation, and suggests that TREC establish a “common,
multi-purpose evaluation framework.” I could not agree more.

The book is more of a collection of independent chapters than an integrated whole, resulting
in redundancies and inconsistencies. The track reports lack a common format. There are
many inconsistencies in the notation used for the formulas for term weighting and
document scoring. The same quantities, such as term frequency within a document, are
designated with different symbols, making understanding and comparing these formulas
unnecessarily difficult. Results are often given as a family of recall-precision curves labeled
by research group rather than by IR technique used, which is what really matters; the
reader must make this connection by laboriously checking in the text. Throughout the book,
a better layout (for example, of bulleted lists) would support faster reading and better
comprehension.

The book provides a great deal of detail about the work in TREC, and its historical evolution,
but a much more systematic and formalized presentation would be needed to let the reader
see the large picture. For example, an overview table showing the different retrieval
methods used across tracks in different years would be very useful, as would be a table of
evaluation measures used by track and year. (For a very broad-brush overview of basic
techniques and results through TREC-6, see Sparck-Jones’ papers [2,3]).

There are some issues with TREC and the claims made in the book; TREC has fundamental
limitations that, while sometimes acknowledged, are often forgotten when stating results.
The test collection corpora have been assembled primarily based on availability; they are
dominated by news, and are not representative of much else (even though several other
document types are included). This is critical: TREC does not support claims for text
retrieval in general, but only sharply limited claims on retrieving news items from
newspaper text.

Test topics are also problematic; it is unknown whether they are representative of all
possible user topics. Topics induce more variance in retrieval performance than systems
(page 94 and elsewhere). TREC’s comparison of systems on average performance over a set
of topics hides the real story; what is needed is research into the reasons for the
topic-to-topic differences in performance, and methods for adapting IR systems to the
topics at hand (as suggested by the SMART team, page 313), which includes finding out
which system does well with which kind of topic. A similar problem of adaptation is ignored
in TREC’s use of a single measure of performance for a given topic, when, in reality,
different users have different requirements with respect to recall, precision, and other
performance characteristics, and systems should be evaluated on their ability to adapt to
specific user requirements. TREC relevance judgments are problematic. For example, IBM
cites inconsistencies in judging as “undercutting” their approach of using hypernyms from
WordNet for answering “What is” questions (page 412). Finally, TREC takes the query
statement (topic statement) for granted, and has each system work from the same
statement. There is wide intuitive agreement (but no empirical proof) that formulating the
right query is half the battle in IR. So, a system could improve users’ success by helping
them to understand and state their information need, and then formulate it properly. This
very important system function is ignored in TREC.

TREC does not measure absolute retrieval performance, but merely compares the
performance of participating systems. This is the justification for limiting relevance
judgments to documents found by the participating systems (the pool). Since almost all
systems use approaches based on words in the text, what is really being measured is the
overlap of relevant documents found by essentially similar systems, which might leave out
whole classes of relevant documents. This methodology does not support claims of absolute
retrieval performance, which is what users are interested in.

TREC collections, topics, and protocols are changing from year to year in order to address
new problems in IR research. This makes longitudinal studies of changes in retrieval
effectiveness difficult, putting further into question the claim that “retrieval effectiveness
approximately doubled during those eight years” (1992-1999) (chapter 4, page 79, and
elsewhere).

Real understanding of how IR systems work often gets buried in the quest to make yet
another ad hoc refinement to a weighting formula. Real understanding requires examining
topic variables (mentioned in the book) and document variables (barely mentioned), in
conjunction with system variables, to explain retrieval results, and then conducting a careful
analysis of successes and failures, looking for patterns, as done in Lancaster’s paper [4].

There is no question that TREC has had a considerable influence on research in information
retrieval: all of the chapters in Part 3 attest to that. While entering a competition may have
enticed research groups to participate, the system rankings (which are based on not very

2 of 3 1/14/2007 7:00 PM
Computing Reviews, the leading online review service for computing lite... http://www.reviews.com/review/Review_review.cfm?review_id=133676...

meaningful averages) may have been less important than what each group learned from its
own experiments, and from discussions with other groups. For example, IBM states, “Of
more ultimate value to IBM was qualitative evidence for how people interpreted the Web
search syntax” (page 407). The testing environment described above would emphasize
learning and the interchange of ideas based on detailed study of experimental results,
rather than competition between systems. Combined with Karen Sparck-Jones’ vision, this
would chart a course for TREC toward a new level, and a broader scope of research, in
information retrieval.

Reviewer: D. Soergel Review #: CR133676

1) Cleverdon, C.; Mills, J.; Keen, M. Factors determining the performance of indexing
systems. Volume 1: Design and Volume 2: Test results. Aslib Cranfield Research
Project, Cranfield, UK, 1966.

2) Sparck-Jones, K. Reflections on TREC. Information Processing and Management 31,


3(1995), 291–314.

3) Sparck-Jones, K. Further reflections on TREC. Information Processing and


Management 36, 1(2000), 37–85.

4) Lancaster, F.W. MEDLARS: report on the evaluation of its operating efficiency.


American Documentation 20, 2(1969), 119–142.

Would you recommend this review? yes no

Other reviews under "Information Search And Retrieval": Date

Google’s Pagerank and beyond: the science of search engine rankings Dec 6 2006
Langville A., Meyer C., PRINCETON UNIVERSITY PRESS, Princeton, NJ, 2006. 234 pp. Type: Book

Temporal pre-fetching of dynamic Web pages Sep 6 2006


Lam K., Ngan C. Information Systems 31(3): 149-169, 2006. Type: Article

Similarity search: the metric space approach (Advances in Database Systems) Aug 31 2006
Zezula P., Amato G., Dohnal V., Batko M., Springer-Verlag New York, Inc., Secaucus, NJ, 2005.
220 pp. Type: Book

more...

E-Mail This Printer-Friendly

Reproduction in whole or in part without permission is prohibited. Copyright © 2000-2007 Reviews.com


Terms of Use | Privacy Policy

3 of 3 1/14/2007 7:00 PM

You might also like