0% found this document useful (0 votes)
8 views

FullProceedings

The document presents the proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, held in Portland, Oregon, featuring selected papers and invited talks. It includes discussions on various topics related to open source search tools, particularly highlighting the WikiQuery tool for creating effective Conjunctive Normal Form (CNF) queries. The workshop aimed to stimulate debate and collaboration among researchers in the field of information retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

FullProceedings

The document presents the proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, held in Portland, Oregon, featuring selected papers and invited talks. It includes discussions on various topics related to open source search tools, particularly highlighting the WikiQuery tool for creating effective Conjunctive Normal Form (CNF) queries. The workshop aimed to stimulate debate and collaboration among researchers in the field of information retrieval.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260282732

Apache Lucene 4

Conference Paper · August 2012

CITATIONS READS

47 3,146

3 authors, including:

Andrzej Białecki
SIGRAM
1 PUBLICATION 47 CITATIONS

SEE PROFILE

All content following this page was uploaded by Andrzej Białecki on 08 June 2020.

The user has requested enhancement of the downloaded file.


Proceedings of the SIGIR 2012 Workshop on
Open Source Information Retrieval
Held in Portland, Oregon, USA,
16th August 2012

Edited by
Andrew Trotman,
Charles L. A. Clarke,
Iadh Ounis,
J. Shane Culpepper,
Marc-Allen Cartright,
and
Shlomo Geva.
Proceedings of the
SIGIR 2012 Workshop on
Open Source Information Retrieval.

Held in Portland, Oregon, USA,


16th August 2012.

Published by:
Department of Computer Science,
University of Otago,
PO Box 56,
Dunedin,
New Zealand.

Editors:
Andrew Trotman,
Charles L. A. Clarke,
Iadh Ounis,
J. Shane Culpepper,
Marc-Allen Cartright,
and
Shlomo Geva.

Paper Edition: ISBN 978-0-473-22025-9


Electronic Edition: ISBN 978-0-473-22026-6

http://opensearchlab.otago.ac.nz/
Copyright of the works contained in this volume remains with the authors

i
Preface
These proceedings contain the papers of the SIGIR 2012 Workshop on Open Source
Information Retrieval held in held in Portland, Oregon, USA, on the 16th of August 2012.
Six full papers and six short papers were selected by the program committee from thirteen
submissions (92% acceptance rate). Each paper was reviewed by three members of the
international program committee. In addition to these selected papers, invited talks were
given by Grant Ingersoll “OpenSearchLab and the Lucene Ecosystem” and Jamie Callan
“The Lemur Project and its ClueWeb12 Dataset”. We thank them for their special
contributions.
When reading this volume it is necessary to keep in mind that these papers represent the
opinions of the authors (who are trying to stimulate debate). It is the combination of these
papers and the debate that will make the workshop a success. We would like to thank the
ACM and SIGIR for hosting us. Thanks also go to the program committee, the paper authors,
and the participants, for without these people there would be no workshop.

Andrew Trotman,
Charles L. A. Clarke,
Iadh Ounis,
J. Shane Culpepper,
Marc-Allen Cartright,
and
Shlomo Geva

ii
Workshop Organisation
Program Chairs
Andrew Trotman
Charles L. A. Clarke
Iadh Ounis
J. Shane Culpepper
Marc-Allen Cartright
Shlomo Geva

Program Committee
Andrew Trotman
C. Lee Giles
Charles L. A. Clarke
Claudia Hauff
Craig Macdonald
David Hawking
Djoerd Hiemstra
Giambattista Amati
Iadh Ounis
J. Shane Culpepper
Jamie Callan
Marc Cartright
Michael Stack
Michel Beigbeder
Richard Boulton
Samuel Huston
Shlomo Geva

iii
Table of Contents

Full Papers
WikiQuery – An Interactive Collaboration Interface for Creating, Storing and Sharing
Effective CNF Queries………………………………………………………………… 1
L. Zhao, X. Liu, J. Callan

ezDL: An Interactive Search and Evaluation System………………………………… 9


T. Beckers, S. Dungs, N. Fuhr, M. Jordan, S. Kriewel

Apache Lucene 4……………………………………………………………………… 17


A. Białecki, R. Muir, G. Ingersoll

Galago: A Modular Distributed Processing and Retrieval System…………………… 25


M.-A. Cartright, S. Huston, H. Feild

A Framework for Bridging the Gap Between Open Source Search Tools…………… 32
M. Khabsa, S. Carman, R. Choudhury, C. L. Giles

Towards an Efficient and Effective Search Engine…………………………………… 40


A. Trotman, X.-F. Jia, M. Crane

Short Papers
SMART: An Open Source Framework for Searching the Physical World…………… 48
M.-D. Albakour, C. Macdonald, I. Ounis, A. Pnevmatikakis, J. Soldatos

First Experiences with TIRA for Reproducible Evaluation in Information Retrieval… 52


T. Gollub, S. Burrows, B. Stein

Design, Implementation and Experiment of a YeSQL Web Crawler………………… 56


P. Jourlin, R. Deveaud, E. Sanjuan-Ibekwe, J.-M. Francony, F. Papa

From Puppy to Maturity: Experiences in Developing Terrier………………………… 60


C. Macdonald, R. McCreadie, R. L. T. Santos, I. Ounis

Yet Another Comparison of Lucene and Indri Performance…………………………. 64


H. Turtle, Y. Hegde, S. A. Rowe

Phonetic Matching in Japanese………………………………………………………. 68


M. Yasukawa, J. S. Culpepper, F. Scholer

iv
WikiQuery -- An Interactive Collaboration Interface for
Creating, Storing and Sharing Effective CNF Queries
Le Zhao Xiaozhong Liu Jamie Callan
Carnegie Mellon University Indiana University, Bloomington Carnegie Mellon University
[email protected] [email protected] [email protected]

ABSTRACT feature is unlikely to be copied by the competitors (the commer-


Boolean Conjunctive Normal Form (CNF) expansion can effec- cial Web search engines).
tively address the vocabulary mismatch problem, a problem that This paper describes one such publicly available open source
current retrieval techniques have very limited ability to solve. search tool, WikiQuery (http://www.wikiquery.org), which both
Meanwhile, expert searchers are found to spend large amounts of engages ordinary searchers in effective search interactions, and is
time carefully creating manual CNF queries. These CNF queries unlikely to be adopted by the commercial Web search engines.
are highly effective, and can outperform bag of word queries by a WikiQuery can provide more effective search interactions than
large margin. However, not many effective tools exist that can what the current search engines can offer, and is flexible enough
facilitate the efficient manual creation of effective CNF queries. to be applied on top of virtually any Web search engine.
We describe such a publicly available search tool, WikiQuery, Prior research showed that the current retrieval techniques are
which can efficiently assist the users to create CNF queries still very limited in their ability to solve the vocabulary mismatch
through easy query editing and immediate access to search results. problem [13]. Users are still frequently frustrated by the current
Experiments show that ordinary search users, with limited prior search engines when performing informational searches [5]. Prior
knowledge of Boolean queries, can use this intuitive tool to create research also indicated that high quality manually created Con-
effective CNF queries. We argue that tools like WikiQuery can junctive Normal Form (CNF) queries offer the opportunity to
attract and retain certain users from the commercial Web search address this limitation and significantly improve retrieval beyond
engines, and may be a good starting point to build a research Web the traditional bag of word queries [14]. A huge potential of im-
search engine. provement is possible in the scale of 50-300% with carefully
manually created CNF queries [14].
Categories and Subject Descriptors The WikiQuery interface is designed to guide and facilitate us-
H.3.3 [Information Search and Retrieval]: ers to create highly effective CNF queries efficiently through 1) a
simple CNF input interface, 2) immediate inspection and interac-
General Terms tion with search results from multiple commercial search engines,
Theory, Experimentation, Measurement and 3) collaboration with other users who share related infor-
Keywords mation needs. The created queries are stored, and readily availa-
Wiki for queries, conjunctive normal form (CNF) queries, query ble for future re-finding or refining. The queries are also shared
refinement, user interactions online so that other users may benefit from the queries or query
parts. Being a Wiki website, different users can collaborate and
improve queries together. This interface is implemented based on
1. INTRODUCTION the MediaWiki source code, which allows the users to search for
One particular goal of the Open Source Information Retrieval pages or information stored on the website, so that it is easy to
workshop is to build „an open source, live and functioning, online lookup, share or collaborate on the website.
web search engine for research purposes‟. A key factor necessary User studies in this work show that ordinary search users with
for the success of such an effort is to attract and retain users. limited knowledge of Boolean queries have the potential to use
In order to attract users, the search engine needs to have a dis- the WikiQuery interface to create effective CNF queries.
tinct and useful feature that is not offered by the current search WikiQuery has the potential to attract and retain users for two
engines. As a somewhat negative example, the Lemur community reasons. Firstly, the CNF query interface is effective and intuitive,
query log project did not collect enough query log data perhaps and can appeal to the ordinary search users -- at least the early
due to the lack of any additional benefit provided by the query log adopters who are willing to learn a new and effective way to for-
toolbar1. Compared to the toolbar, a full scale open source search mulate search queries, or the more serious users who care about
engine is even more likely to fail, as the quality of the results from their searches. Secondly, the commercial Web search engines are
such an academic search engine is likely to be much worse than very unlikely to adopt the CNF interface, because the change in
that from the commercial Web search engines. user experience is large enough to scare away the change-averse
In order to retain users, it is perhaps necessary that the distinct users, making it very risky to use for a large Web search engine.
In addition to the added benefit of facilitating search interac-
tions, the resulting crowdsourced CNF queries stored on the Wik-
The copyright of this article remains with the authors. iQuery website also constitute a detailed context dependent the-
SIGIR 2012 Workshop on Open Source Information Retrieval. saurus for retrieval and other vocabulary tasks.
August 16, 2012, Portland, Oregon, USA.
The rest of the paper is organized as follows. Section 2 intro-
duces related work. Section 3 describes the WikiQuery website
together with its CNF query interface. Section 4 reports the stud-
ies showing that ordinary users can create effective CNF queries
1
http://lemurstudy.cs.umass.edu/ with the proper tool and guidance. Section 5 concludes the paper.

1
2. RELATED WORK limited knowledge of Boolean queries have the potential to create
This section reviews prior work related to three aspects of this effective Boolean CNF queries using the WikiQuery interface.
work, the uses of Boolean CNF queries, Boolean user interfaces, This apparent contradiction is likely because the prior studies did
and the use of the user generated Boolean queries as a resource for not focus on Boolean CNF queries, and gave novice users the full
thesaurus building. We also discuss how this research is different freedom of free form Boolean queries without proper guidance.
from prior efforts. This choice leaves the creation of effective Boolean queries to the
chances, and is likely to lead to ineffective Boolean queries. Our
2.1 Uses of CNF Queries results point at a promising direction of designing search interfac-
Prior research on effective uses and formulations of Boolean CNF es that guide and facilitate users to formulate effective Boolean
queries motivates this research. The use of Conjunctive Normal queries in CNF form.
Form (CNF) queries is widespread among librarians [9,6], lawyers
[3,2], and other expert searchers [7,4,11]. 2.3 Online Thesaurus Building
For example, the query below from TREC 2006 Legal Track [2] The resulting CNF queries created by users and stored in Wiki-
Query can serve as a thesaurus for future users. In particular, each
“sales of tobacco to children” conjunct in the CNF queries contains synonyms or related terms
that are dependent on the context of the query. Compared to ex-
is expanded manually into the Boolean CNF query
isting thesauri like WordNet, the WikiQuery synonyms depend on
(sales OR sell OR sold) AND the specific uses of a term in a query, while WordNet is still a
(tobacco OR cigar OR cigarettes) AND static semantic resource without regard to word use.
(children OR child OR teen OR juvenile OR kid OR adolescent) The thesaurus building aspect of the WikiQuery website is simi-
lar to an earlier system that builds a growing thesaurus based on
In the above case, each query term is expanded into one con- users‟ Boolean retrieval interactions [12]. The main difference is
junct of the Conjunctive Normal Form query. the emphasis on CNF queries by WikiQuery. WikiQuery also
Earlier research on Boolean queries examined unranked Boole- treats individual queries as valuable resources, and as units for
an retrieval, and showed that ranked keyword retrieval is more storage and retrieval. This is a fairly lazy and ad hoc treatment for
effective, mainly because presenting retrieval results as a set is a thesaurus. Later more general treatments can build on top of the
both difficult to control and inefficient to examine. Later research queries stored on WikiQuery, when it becomes clear what kinds
compared ranked Boolean with keyword retrieval, showing that of general treatments are most appropriate.
user created CNF queries can significantly improve over keyword
retrieval by simply grouping the query terms of the verbose key- 3. THE WIKIQUERY WEBSITE
word queries into Conjunctive Normal Form [7,11]. The search tool described in this work is a public Wiki website
More recent research showed that lawyers and search experts based on the same source code that supports Wikipedia etc. sites.
can create highly effective CNF queries that extensively expand On the WikiQuery website, each Wiki page stores all the infor-
the original keyword queries, solving mismatch and improving mation about one particular user information need, including pos-
retrieval 50-300% [14]. These CNF queries with high quality sibly a description of the information need, the corresponding
expansion terms were shown to outperform bag of word expan- CNF query (or several related CNF queries), possible relevant
sion with the same set of high quality expansion terms. results (together with descriptions) identified through the search
interactions, or other related information.
2.2 Boolean Search Interfaces An example WikiQuery page is shown in Figure 1. The main
Even though carefully created CNF queries are effective, recent CNF query of the page and the links to the search engine result
research has focused on bag of word queries, and has not seen pages from multiple search engines are circled out.
much development in interfaces that help users create effective The open source MediaWiki code (http://www.mediawiki.org)
CNF queries. Research on Boolean user interfaces happened offers the standard set of features used in popular Wiki websites.
mostly before mid 1990s. Hearst [1, Chapter 10] cited several One useful function allows the users to search for pages or infor-
textual as well as graphical Boolean interfaces. Hearst referred to mation stored on the website through entering a search query in
CNF queries as faceted queries, and described a possible textual the search box. In addition, being based on MediaWiki version
input interface for CNF queries, though without a concrete exam- 1.17, the Wiki website automatically suggests existing WikiQuery
ple. In a newer book, Hearst [8] cited the advanced search inter- pages as the user types into the search box. Other features include
face of the Educational Resources Information Center (ERIC) 2 , history tracking of all the user edits of pages and users, subscrib-
which allows the entry of CNF queries in a one-conjunct-per-line ing to a page to monitor changes made to the page, and opportuni-
format. This is similar to the CNF interface of WikiQuery except ty of discussion among contributors of a Wiki page.
for two differences. Firstly, the ERIC interface is not specifically Several simple customizations were made to accommodate the
designed for CNF queries, and allows the user to enter a query in special user needs for the WikiQuery website, including 1) a sim-
Disjunctive Normal Form. Secondly, the ERIC interface gives no ple textual interface for CNF query editing, 2) an automatic client
guidance or useful examples to the user on how to create effective side script to display the query and store it in the Wiki page, and 3)
Boolean queries. an automatic script that translates the CNF query into the formats
The lack of research on Boolean interfaces is coupled with a accepted by common Web search engines, allowing immediate
long list of negative results [8, Section 4.4] showing that ordinary inspection of the retrieval results produced by the CNF queries.
users have a difficult time formulating effective Boolean queries. The rest of this section covers these customizations in more detail.
This work, on the contrary, shows that ordinary search users with
3.1 Interface for CNF Query Editing
2
http://www.eric.ed.gov/ERICWebPortal/search/extended.jsp
accessed on June 1st, 2012.

2
1. Conjunctive Normal Form query

2. Links to search engine result pages

Figure 1. An example Wiki page from the WikiQuery Website. Circled out are the Conjunctive Normal Form (CNF) query of this Wiki
page and the links to the search engine result pages for the CNF query.

The CNF query interface is presented to the user when the user quality of the CNF queries beyond what a single user may achieve.
edits a Wiki page. It guides the user and allows the user to easily Because high quality CNF queries can take lots of effort to create,
and efficiently create or edit CNF queries. As shown in Figure 2, collaboration offers the possibility to break down the difficulty
this interface is consisted of several input bars, each correspond- through sharing it among a group of users.
ing to one conjunct in the CNF query. The user has to determine This Boolean CNF interface is different from the typical inter-
how many and what concepts (conjuncts) are necessary for the faces used in advanced searches in libraries or by lawyers in legal
particular information need. Then the user has to enter in each discovery. The advanced search interfaces in libraries (e.g. Li-
input bar the search terms that can be used to describe the concept, brary of Congress) allow a restricted Boolean query of the form:
and join them with the Boolean OR operator. Prior research [14] Term1 Op1 Term2 Op2 …, where a Term can be a word or a
indicated that including more high quality expansion terms in phrase, but cannot be a Boolean clause, and an Op is a Boolean
each conjunct yields a higher likelihood for the conjunct to match operator (e.g. AND, OR, XOR, and NOT). The WikiQuery CNF
the relevant documents of the query, and leads to a higher retriev- interface is more powerful than the library Boolean interfaces
al accuracy. because any Boolean query can be expressed as a CNF query.
The CNF queries are stored on the wiki to allow users to revisit The Boolean interface used by lawyers is usually just one large
existing queries, and to further improve the queries. Because text box. They are flexible and allow free form Boolean queries
refinding tasks are fairly common in Web search, a user might to be entered into a, typically large, text box. However, the law-
frequently find the stored queries to be helpful at a future time. yers typically create CNF-like queries [2], and have to enter the
Whenever the user edits an existing WikiQuery page that al- whole query by themselves, having to make sure that the paren-
ready contains a full CNF query, the CNF input bars are automati- theses match and the form is correct. The WikiQuery CNF inter-
cally populated with the content of the CNF query, so that the user face facilitates simpler and more efficient manual creation of
does not have to enter the query into the input bars again. CNF-like queries. It breaks down each query by allowing the user
The collaborative nature of the Wiki website also allows differ- to enter each conjunct into one input box. This way, the user does
ent users to collaborate and edit the same WikiQuery page of not need to enter the CNF skeleton, nor the conjunct level paren-
common interest to these users. For popular information needs, theses. Although this CNF interface suggests the use of CNF-like
collaborations across multiple users are likely to improve the queries, it does not require that. The user still has the freedom to

3
1. CNF query input interface

2. Search engine result page

3. Automatically updated CNF query stored in Wiki

4. Links to search engine results

Figure 2. The editing interface of WikiQuery. It includes a Conjunctive Normal Form (CNF) query editing interface and buttons to access
search engine result pages. The full CNF query and the HTTP links to the search engine result pages for the query are automatically gener-
ated and stored in the page.
enter a free form Boolean query into one input bar in the Wiki- Multiple search engines are supported. These CNF queries can
Query interface, although this usage is generally discouraged. work on any search engine that supports Boolean query operators.
Currently it employs Google, Google Scholar, Google Patents and
3.2 Query Storage and Display Yahoo (Altavista), but can be easily extended to use other search
The query that the user enters into the CNF editing interface (Fig- engines like Bing which uses a slightly different query language.
ure 2) is automatically translated into the full form CNF query
that is stored and displayed on each Wiki page (as seen in Figure 4. USER STUDY
1). Whenever the user makes a change in one of the input bars in Prior research already confirmed the effectiveness of the manual
the CNF edit interface, a short Javascript is automatically trig- CNF queries, and that expert searchers can create effective CNF
gered to translate the contents of the input bars into the full query queries [14]. The study in this section aims to verify the hypothe-
as part of the content of the Wiki page to be stored. sis that ordinary users with no or limited prior knowledge of
Because the editing interface uses the document content of the Boolean queries can create effective Boolean CNF queries using
Wiki page to store and display CNF queries, all the benefits from the WikiQuery CNF interface.
MediaWiki for maintaining the Wiki contents automatically apply 6 users participated in this preliminary user study, each respon-
to the stored CNF queries, change tracking and searching. sible for 2 information needs. Users typically proposed a series of
Boolean queries. We report the retrieval performance of the
3.3 Querying Search Engines Boolean queries against the baseline keyword queries.
The WikiQuery website allows the user to access search engine
result pages for the CNF queries very easily, during page viewing 4.1 Experiment Setup
and editing. Access to the result pages during page view allows This subsection describes the details of this experiment, including
viewers of the WikiQuery website to easily check the search en- user selection, information needs, evaluation details, relevance
gine results for a CNF query. Access to the result pages in the judgments and evaluation methodology.
editing interface allows the user to monitor the quality of the
search engine results after making changes to the CNF queries, 4.1.1 Classroom Users
ensuring the quality of the resulting CNF queries. Users for this study come from an IR class in an information
Figure 1 shows the links displayed on the stored WikiQuery school. A total of 6 students participated in the study. We pur-
pages during page viewing. Figure 2 shows the buttons that posely launched the study at the beginning of the semester when
would open a new window to allow the user to navigate to the students were not fully exposed to the professional CNF queries.
search engine result pages for the query that the user is editing. As participants had little knowledge of Boolean queries or Wiki
page editing, a 10-minute session was given before the study,

4
which included an example information need with a walk through The TREC relevance judgments only exist on the TREC docu-
the CNF query creation and editing process on WikiQuery. Be- ment collections, so the queries need to be run against the smaller
cause this study counts as one homework assignment for the stu- TREC collections, instead of a Web search engine.
dents, participants were highly motivated to spend time and do We used the Indri search engine of the Lemur toolkit version
well in creating effective CNF queries. Users were also asked to 4.10 to execute the Boolean queries on TREC document collec-
document the detailed query formulation and retrieval experience. tions. Indri supports a fairly comprehensive query language. The
backend model is language model with Dirichlet smoothing. The
4.1.2 Information Needs Boolean OR operator is implemented as the Indri #syn operator.
12 topics from TREC Ad hoc and Terabyte tracks were selected #syn counts term frequency and document frequency of the whole
as candidate topics. Topics were selected to be somewhat inter- group of synonyms by treating all the synonyms in the group as
esting for the students, and reasonably difficult so that the key- the same term. The Boolean AND operator is implemented as the
word queries were unlikely to return perfect results. Each student Indri #combine operator which is the probabilistic AND operator.
was randomly assigned two topics, for which the student would This Indri implementation of a Boolean query automatically re-
assume the role of the searcher and create queries. turns a rank list of documents, instead of an unranked set. This is
Each TREC topic contains a short title and a long description of a more effective form of result presentation than an unranked set
the information need. The students used all the available infor- of documents, and is widely adopted by modern retrieval systems.
mation of the topic to grasp the intent behind the topic. They Equations (1, 2) show how Indri scores document d with query
generated queries for each topic and were encouraged to interact (a OR b) AND (c OR e). tf(a, d) is the number of times term a
with the search results to improve the proposed queries. appears in document d. μ is the parameter for Dirichlet smoothing,
One reason for using standard TREC topics is to use the existing which is set at 900 for the Ad hoc track datasets and 1500 for the
relevance judgments on these datasets to evaluate the user queries. Terabyte track datasets.
4.1.3 Baseline Keyword Queries Score( (a OR b) AND (c OR e), d) (1)
The keyword queries were directly taken from the TREC topic = P( (a OR b) AND (c OR e) | d)
titles and descriptions. The topic titles are shorter, usually 2 to 4 = P( (a OR b) | d) * P( (c OR e) | d)
terms long, and the descriptions are much longer, around 5 to 10
P( (a OR b) | d) (2)
terms long. These two types of keyword queries correspond to the
= ( tf(a, d) + tf(b, d) + μ * (P(a | C) + P(b | C)) ) / (length(d) + μ)
two baselines: keyword title and keyword desc.
= P(a | d) + P(b | d) (under Dirichlet smoothing)
4.1.4 User Created Boolean Queries
Users were asked to create Boolean CNF queries using the Wiki- 4.1.6 Evaluating on Commercial Search Engines
Query interface, which allows them to enter the query, to examine The WikiQuery website allows the users to run the CNF queries
the results returned by the search engines for the query, and to they created on commercial Web search engines, which typically
improve the query. On average, a user spent around 40 minutes have a much larger collection of documents, and the retrieval
on each information need, based on the recorded history of chang- algorithms are typically more effective than the experimental
es of the WikiQuery pages. For each information need, the users systems used by researchers. This section tries to evaluate the
created several queries or improved the CNF query many times. effectiveness of the CNF queries by running them on commercial
The users were asked to submit two versions of Boolean queries search engines.
for each information need, an initial Boolean query and a final Given a Boolean CNF query as input, the Web search engines
version of the Boolean query. The initial Boolean query repre- return ranked lists of documents as results. The exact ranking
sents a very first try by the user, and the final Boolean query is formulae of these search engines are difficult to know. Standard
usually the result of fine tuning the CNF query based on interac- ways of producing ranked retrieval results from Boolean queries
tions with the retrieval results of all the queries the user tested. include probabilistic Boolean retrieval models, quorum-level
ranking [8, Section 4.4.2], or simply using keyword retrieval to
4.1.5 Evaluating on TREC Datasets rank the documents that match the Boolean query. An empirical
Since the information needs come from official TREC Ad hoc comparison of some of them can be found in [14].
track and Terabyte track topics, existing relevance judgments Relevance judgments are needed to evaluate the effectiveness of
from TREC can be used to evaluate the effectiveness of the user the results returned by the search engines. Users who participated
proposed queries. The advantage of using the TREC relevance in the study did relevance judgments for each other for the top 5
judgments is that these judgments are fairly reusable, and more results returned by the Web search engines (Google and Yahoo).
complete than just evaluating several top results returned by a The assessors did not know what query, search engine or rank a
search engine. Thus, one can evaluate the result lists returned on particular result page comes from. These user-provided judg-
the TREC datasets to much deeper levels. TREC standard evalua- ments were obtained at the time of the homework assignment
tion metrics report retrieval accuracy of the top 1000 results. One (February to March 2011). One of the authors of this paper veri-
may question why such deep level of assessment is necessary for fied these user provided relevance judgments for accuracy.
Web search where users typically only look at top several results.
We argue that the deeper level metrics are more sensitive to re- 4.1.7 Evaluation Metrics
trieval algorithm differences. Certain retrieval algorithm changes For TREC judgments, we report Mean Average Precision (MAP)
may not surface as top rank result changes on a small set of test for the top 1000 results. It is a standard measure because it is
topics, but may change the rank list more dramatically at deeper sensitive to rank list changes and also fairly stable when compar-
levels for these topics. These deeper level changes may show up ing between systems [10]. We also report Precision at top 5 and
at the top ranks on a small subset of topics of a much larger test 10 as measures of retrieval accuracy at top ranks. For evaluation
topic set. At the very least, the deeper level metrics provide an on the search engines, we report Precision at top 5, and MAP at 5,
additional perspective to the effectiveness of the rank lists. as deeper relevance judgments are not available.

5
Table 1. The differences between short keyword and the final Users did not always follow the instruction to include expansion
Boolean CNF query for each TREC topic. The types of terms when formulating CNF queries. Only 5 of the 12 queries
changes include restricting the query (by including more con- included some synonym expansion, while 9 out of the 12 queries
juncts or using phrase operator to require query terms to occur were modified to be more restrictive than the keyword query.
close together) and expanding the query terms by including These queries are less well expanded than the queries created by
more synonyms (highlighted by shading the topic number). expert searchers [2,14]. The sections below show how that affects
Topic Query with changes from short keyword to final retrieval performance.
No./ Boolean, (bolded are insertions and slashed are re- 4.2.2 Effectiveness – TREC Evaluation
Type of movals). Retrieving on the TREC datasets, the user created CNF queries
Change are fairly effective overall (Table 2). On average, the final Boole-
352 #combine( #syn( #1(British Chunnel) #1(Channel an CNF queries perform the best on all three evaluation metrics.
both Tunnel) ) #syn( #1(effect on) changes ) These final CNF queries perform significantly better than the long
#syn( #1(#syn(British UK) economy) #1(economic keyword queries both at top ranks (P@5, 10) and at overall accu-
#syn(implications changes evaluation)) ) impact) racy (MAP@1000), and outperform on average the short keyword
354 #combine( #syn(reporter newswriter journalist queries. This result is consistent with prior research, where short-
expand correspondent) #syn(arrested hostage #1(physical er keyword queries perform better than long queries [13].
Even though CNF queries are better than short keyword queries,
attack) killed threatened kidnapped murdered
attack shot) risks ) the difference is not statistically significant. We look at the indi-
vidual queries to understand which CNF queries are better and
704 #combine( #1(green party) #syn(US #1(united which are worse than the corresponding short keyword query.
restrict states)) #syn( #1(political views) politics) ) For expansion queries in Table 3 (topics 354, 751 and 758), top-
751 #combine( scrabble #syn(players group) #syn(social ic 354 is the only one that decreases performance. The reason is
expand events) ) that “reporter” is stemmed and matches the word “report”, which
is a common word in the TREC newswire collection, causing the
752 #combine( #syn(location places countries) dam expanded query to match many false positives.
restrict removal Environmental impact reason ) For the restrictive Boolean queries, the performance gain is less
753 #combine( bullying prevention programs in schools stable. 5 (topics 752, 760, 764, 799, 805) out of the 9 restrictive
restrict #syn(classes assemblies discipline mediation pro- Boolean queries perform worse than the short keyword query in
jects) #syn(Students staff) ) MAP. Even in top precision, which is usually the users‟ goal for
using more restrictive queries, 4 out of the 9 restrictive Boolean
758 #combine( embryonic stem cells #syn(restrictions queries perform worse than short keyword queries.
expand law policy) ) This result on TREC datasets shows that many of the CNF ex-
760 #combine( statistics #1(in America) Muslims pansion queries and a few of the restrictive Boolean queries creat-
both #syn( population demographics ) #syn(mosque ed through interacting with a larger dataset (the Web) and very
#band(Islamic center)) school ) different retrieval algorithms can still effectively retrieve relevant
documents on a smaller dataset. For expansion queries, this may
764 #combine( measures improve public transportation
be because on the one hand, accurate CNF expansions on larger
restrict increase mass transit use )
collections are less likely to match false positives on smaller col-
769 #combine( #1(Kroll Associates) employee names ) lections, ensuring precision. On the other hand, the mismatch
restrict problem is likely to get worse on the smaller collections with
799 #combine( type of animals Alzheimer research) ) fewer relevant documents, thus, the CNF expansions are more
restrict likely to be useful in improving recall on the smaller collections.
Overall, in both precision and recall, the CNF expansions created
805 #combine( identity theft passport help victims iden- for larger collections may work well on the smaller collections.
restrict tify establish credit worthiness show However, the more restrictive queries that perform well on large
#syn(creditors #band(law enforcement)) ) Web collections may not perform as well on smaller collections.
Note: #combine is Indri‟s probabilistic AND operator, #syn is Boolean
OR, #1 is the phrase operator, and #band is the Boolean AND. 4.2.3 Effectiveness – Evaluation on Search Engines
The Boolean queries are clearly more effective than short key-
words on the commercial Web search engines at top ranks, as
4.2 Experiment Results shown in Table 4 (overall) and Table 5 (per topic).
This section reports the characteristics of the user generated CNF Table 4 shows the evaluation on Google and Yahoo. On
queries, and the effectiveness of these CNF queries compared Google, final CNF queries significantly outperformed the short
against the keyword query baselines. keyword queries in retrieval performance at top ranks. On Yahoo,
CNF was on average better than short keyword, but the difference
4.2.1 Characteristics of the User CNF Queries was not statistically significant, because 3 Boolean queries were
Query characteristics varied a lot across different users: 2 to 6 worse than the short keyword baseline. (For topic 352 on Yahoo,
conjuncts for the initial Boolean queries, each conjunct containing the expansion phrase “channel tunnel” is too general and matches
1 to 5 synonyms. The final Boolean queries were expanded a bit many false positives. For topic 354 on Yahoo, many false posi-
more, with 2 to 6 conjuncts, each containing 1 to 9 synonyms. tives contain “Reporter”, “Journalist” and “Correspondent” as
Table 1 shows how the users modified the original short key- titles of publications or newspapers, instead of as a person. For
word queries into the final Boolean CNF queries.

6
Table 2. Retrieval performance on TREC datasets, averaged Table 4. Retrieval performance with search engine evaluation,
over the 12 topics. Bold-faced is the best run in each row. averaged over the 12 topics. Bold-faced are the better run(s) in
each row.
\Query type Keyword Keyword CNF CNF
Metrics\ (short) (long) (initial) (final) \Query type Keyword (short) CNF (final)
l l l \ & Metrics MAP@5 P@5 MAP@5 P@5
MAP@1000 0.1635 0.1038 0.1815 0.2017 Search engine\
P@5 0.5833l 0.3167 0.5333l 0.6000l Google 0.1586 0.6333 0.2247sn 0.8500sn
l l ln
P@10 0.5333 0.3000 0.5250 0.5500 Yahoo 0.1701 0.7167 0.2041 0.7167
l s
means the run is significantly better than the long keyword baseline by a means significantly better than the short keyword baseline by a two
two tailed t-test at p < 0.05. tailed t-test at p < 0.004.
n n
means significantly better than long keyword by sign test at p < 0.05. means also significant by sign test at p < 0.004.
Table 3. Per topic retrieval performance of final Boolean CNF Table 5. Per topic retrieval performance in MAP@5 for final
query vs. short keyword query on TREC datasets. Bold faced Boolean query vs. short keyword query on Google and Yahoo.
is the better result of keyword and CNF queries in the row. Bold faced is the better result of keyword and CNF queries.
TREC Keyword (short) CNF (final) (vs. short keyword) TREC Keyword (short) CNF (final)
Topic Topic Google Yahoo Google change Yahoo change
MAP P@5 MAP change P@5 change
No. No.
352 0.0462 0.8 0.1175 154.3% 0.6 -25.00% 352 0.1133 0.1300 0.2000 76.52% 0.0000 -100.0%
354 0.0542 0.4 0.0328 -39.48% 0.0 -100.0% 354 0.1462 0.1538 0.1923 31.53% 0.0192 -87.52%
704 0.2167 0.4 0.3928 81.26% 0.8 100.0% 704 0.3800 0.3800 0.5000 31.58% 0.4000 5.263%
751 0.1746 1.0 0.2170 24.28% 1.0 0.000% 751 0.0467 0.1600 0.2000 328.3% 0.2000 25.00%
752 0.2237 1.0 0.1574 -29.64% 0.8 -20.00% 752 0.1368 0.1693 0.1693 23.76% 0.2000 18.13%
753 0.3472 0.6 0.4736 36.41% 1.0 66.67% 753 0.0883 0.0800 0.2500 183.1% 0.1900 137.5%
758 0.3144 1.0 0.3187 1.368% 1.0 0.000% 758 0.2083 0.2083 0.2083 0.000% 0.2083 0.000%
760 0.1609 0.8 0.1279 -20.51% 1.0 25.00% 760 0.2923 0.2923 0.3846 31.58% 0.3077 5.269%
764 0.1999 0.6 0.0180 -91.00% 0.4 -33.33% 764 0.0000 0.0200 0.0833 +inf 0.2750 1275%
769 0.0143 0.0 0.4588 3108% 0.6 +inf 769 0.0000 0.0464 0.0179 +inf 0.1940 318.1%
799 0.1850 0.4 0.0946 -48.87% 0.0 -100.0% 799 0.1786 0.1786 0.1786 0.000% 0.1429 -19.99%
805 0.0247 0.0 0.0108 -56.28% 0.0 0.000% 805 0.3125 0.2219 0.3125 0.000% 0.3125 40.83%

topic 799 on Yahoo, the CNF query is only slightly worse, return- queries that improve top precision. This could explain why many
ing the only irrelevant result at rank 5.) of the Boolean queries were more restrictive versions of the short
The user created Boolean queries outperform short keyword keyword queries. These restrictive queries would likely increase
queries consistently, but different Boolean queries improve over top precision on the search engines (which searched against very
the keyword queries for different reasons. The synonym expan- large corpora), but would likely decrease lower rank performance
sion queries are better than keyword because they can solve the as suggested by the deeper evaluations on the TREC datasets. On
mismatch problems of the individual query terms in the keyword the much smaller TREC corpora, these restrictive queries will
query. Topics 354, 751 and 758 are such examples. The restric- match much fewer documents, thus could even hurt top precision.
tive type of Boolean queries outperforms keyword queries be- Overall, the more restrictive Boolean queries perform unstably at
cause the short keyword queries may match many false positives both top rank and lower rank levels on the smaller TREC datasets.
on the Web. A slightly more restrictive query can remove these This suggests that even though it may seem to the user that a re-
false positives while still match enough relevant documents to fill strictive query would be better, more often than not, synonym
up the top ranks. Topics 704, 753 and 769 are examples. expansion is the more robust strategy of query formulation.

4.2.4 Discussion 5. CONCLUSIONS


When comparing CNF queries with short keyword queries, on Boolean CNF expansion queries have the potential to significantly
TREC datasets, the difference is not very significant, however, on outperform keyword queries, leading to much more effective re-
the search engines, CNF queries are consistently better. trieval. This paper investigates whether ordinary search users
This difference is likely because of two reasons. Firstly, when with limited knowledge of CNF queries can formulate effective
the users created the CNF queries, they tuned the queries by ob- CNF queries using the WikiQuery interface.
serving their retrieval results returned from the search engines. Evaluations on TREC datasets show that versus lengthening the
Thus, as long as the user makes a serious effort, the tuned Boolean short keyword queries by adding more keywords, creating a Bool-
queries will be better performing than the keyword queries on the ean structured query can be significantly more effective at both
search engines that the users tuned their queries on. Secondly, top and deeper level retrieval accuracy. These Boolean queries
only top rank performance was observed and measured with the are also better performing than the short keyword queries on aver-
search engines. Since results deeper down the rank list were not age. However this difference is not statistically significant on the
available to the users, they would tend to create highly restrictive TREC datasets. Evaluations of the user created Boolean queries

7
on commercial Web search engines show that these highly precise To help users decide whether to include one particular expan-
Boolean queries can consistently and significantly outperform the sion term into a conjunct or not, tools that can compare rank list
original short keyword queries in top precision. changes before and after a query change will be useful. In partic-
Both expansion and restriction query modifications were com- ular, tools that can present deeper rank level changes will enable
mon when the users created the Boolean queries from the short the user to more accurately gauge the overall retrieval accuracy.
keyword queries. Some of the expansion queries included many The WikiQuery website may also allow users to subscribe to the
synonyms for each original query term, just like those created by result pages of each CNF query, so that whenever a new relevant
experts [14]. These carefully expanded CNF queries have been page appears on the Web, the user will be notified. This is the
shown to outperform keyword queries in precision at all recall equivalent of a traditional routing task, and can be easy imple-
levels, because CNF expansion can effectively solve term mis- mented given that search engines like Google already support user
match, a common problem in retrieval with a large potential [14]. subscription to search engine result pages.
However, even with instructions and the guidance from Wiki-
Query, users still tended to create less well expanded queries. 6. REFERENCES
Users also tended to restrict the original keyword query by intro- [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
ducing phrases or more conjuncts, causing more mismatches be- Retrieval. ACM Press, New York, 1999.
tween the query and the relevant documents. These restrictive [2] J. R. Baron, D. D. Lewis and D. W. Oard. TREC 2006 Legal
queries might improve top precision, but deeper level evaluation Track Overview. In Proceedings of the fifteenth Text RE-
on TREC datasets showed that these restrictive queries do not trieval Conference (TREC '06), 2007.
result in stable improvements at lower rank levels. This tendency [3] D. C. Blair and M. E. Maron. An evaluation of retrieval ef-
to create the less effective restrictive-queries is perhaps one of the fectiveness for a full-text document-retrieval system. Com-
reasons why novice users have difficulty creating effective Boole- munications of ACM, 28(3): 289-299. 1985.
an queries or structured queries. [4] C. L. A. Clarke, G. V. Cormack and F. J. Burkowski. Short-
est Substring Ranking (MultiText Experiments for TREC-4).
Use in Text Retrieval In Proceedings of the Fourth Text REtrieval Conference
Our results suggest that to improve users‟ interactions with the (TREC-4). 1996.
search engine, and to facilitate them in creating effective queries, [5] H. A. Feild, J. Allan and R. Jones. Predicting searcher frus-
users need to be carefully guided to create CNF expansion queries, tration. In Proceedings of 33rd annual international ACM
and to be explicitly warned against the risky restrictive queries. SIGIR conference on Research and Development in Infor-
Classroom Use mation Retrieval (SIGIR '10). 34-41, 2010.
This work used WikiQuery as an educational tool for students [6] S. Harter. Online Information Retrieval: Concepts, Princi-
with limited knowledge about Boolean queries to learn to create ples, and Techniques. Academic Press. San Diego, Califor-
effective Boolean queries in a short time. We observe that most nia. 1986.
students spent about 40 minutes per topic, trying out new queries [7] M. Hearst. Improving full-text precision on short queries
and interacting with the search results to find effective formula- using simple constraints. In Proceedings of the Fifth Annual
tions. Trial and error using the interactive interface of WikiQuery Symposium on Document Analysis and Information Retrieval
helped the students quickly and effectively learn the subject. (SDAIR '96), 1996.
Open source search tools like WikiQuery and IR education can [8] M. Hearst. Search User Interfaces. Cambridge University
be mutually beneficial. These tools may become the appropriate Press. 2009.
playground for educational uses, while classroom uses can also [9] W. Lancaster. Information Retrieval Systems: Characteris-
provide a steady stream of traffic for these search tools. tics, Testing and Evaluation. Wiley. New York, New York,
USA. 1968.
Future Work [10] C. D. Manning, P. Raghavan and H. Schutze. Introduction to
The WikiQuery website is still in its early stage. This work as a Information Retrieval. Cambridge University Press. 2008.
pilot study can be used to guide and prioritize the development of [11] M. Mitra, A. Singhal and C. Buckley. Improving automatic
many new and helpful features for WikiQuery. query expansion. In Proceedings of 21st annual internation-
Search result presentation needs to be improved to help users al ACM SIGIR conference on Research and Development in
quickly grasp why a particular document is returned and what Information Retrieval (SIGIR '98). 206-214, 1998.
terms in each conjunct of the CNF query are present in the docu- [12] P. Reisner. Construction of a growing thesaurus by conver-
ment. Such understandings will allow users to efficiently identify sational interaction in a man-machine system. Proceedings
further refinements of the CNF query to improve the results. On a of the American Documentation Institute, 26th annual meet-
commercial search engine, such user interface changes would be ing, Chicago, Illinois, 1963.
deemed too risky. A research oriented search engine might be the [13] L. Zhao and J. Callan. Term necessity prediction. In Pro-
best place to lead the effort. ceedings of the 19th ACM Conference on Information and
To further facilitate users in their CNF query creation, syno- Knowledge Management. 2010.
nyms or other related words of each conjunct could be automati- [14] L. Zhao and J. Callan. Automatic term mismatch diagnosis
cally suggested to the user, so that the user only needs to select the for selective query expansion. To appear Proceedings of
highly precise expansion terms out of the suggestions. 35th annual international ACM SIGIR conference on Re-
To better facilitate users in query refinement, novel interfaces search and Development in Information Retrieval (SIGIR
that can automatically extract or highlight candidate expansion '12). 2012.
terms in result snippets or documents can be useful.

8
ezDL: An Interactive Search and Evaluation System

Thomas Beckers Sebastian Dungs Norbert Fuhr


Information Engineering Information Engineering Information Engineering
University of Duisburg-Essen University of Duisburg-Essen University of Duisburg-Essen
Duisburg, Germany Duisburg, Germany Duisburg, Germany
thomas.beckers@uni- sebastian.dungs@uni- [email protected]
due.de due.de
Matthias Jordan Sascha Kriewel
Information Engineering Information Engineering
University of Duisburg-Essen University of Duisburg-Essen
Duisburg, Germany Duisburg, Germany
matthias.jordan@uni- sascha.kriewel@uni-
due.de due.de

ABSTRACT strategic support. It builds on the ideas developed and im-


The open-source system ezDL is presented. It is an interac- plemented within the Daffodil project from 2000 to 2009 [11,
tive search tool, a development platform for interactive IR 16, 13], but uses more modern software technologies and in-
systems, and an evaluation system. ezDL can be used as a terface design methods.
meta-search system for heterogeneous sources or digital li- The ezDL framework can be characterized by three main
braries, allows organizing and filtering of merged results, of- purposes. It is foremost i) a working interactive tool for
fers support for search sessions as well as a personal library searching a heterogeneous collection of digital libraries. In
for storing different document types. The ezDL framework addition to that, it is ii) a flexible software platform provid-
is easy to extend and is based on a service-oriented architec- ing a solid base for writing customized applications as well
ture. In addition, support for performing user studies and as iii) a system that can be used for many different types of
eye tracking is provided. ezDL has been used as a system in user evaluations.
several funded research projects. Today many systems covering one or more aspects of ezDL
exist but to the best of our knowledge the concept of unifying
them into one single framework is unique. In the following
Categories and Subject Descriptors paragraphs similar systems related to the different aspects
H.3.4 [Information Storage and Retrieval]: Systems of ezDL are presented.
and Software
Interactive Search Tools
General Terms Querium [9] is an interactive search system featuring a con-
cept that focuses on complex recall oriented searches. It
Human Factors, Experimentation aims at preserving the context of searches and allows rele-
vance feedback to generate alternative result sets. At the
Keywords moment, the system is limited to two data sources.
interactive search system, framework Numerous tools exist that focus on storing and managing
a personal library or citations. A popular example is Mende-
ley2 which also offers different front ends and collaborative
1. INTRODUCTION features. Other citation tools include CiteULike3 and Con-
In this paper we present ezDL, an open-source1 software notea4
for building highly interactive search user interfaces with CoSearch [1] is a collaborative web search tool. It offers
1
a user interface that can simultaneously take input from
ezDL is licensed under GPL v3. Other licenses can be used different users sharing a single machine. Mobile devices can
on request. The main web site for developers and further in- be used to contribute to a collaborative search session. Data
formation can be found here: http://ezdl.de/developers
is acquired by using a popular web search engine.

Development Platforms
Permission to make digital or hard copies of all or part of this work for SpidersRUs Digital Library Toolkit [8] is a search engine de-
personal or classroom use is granted without fee provided that copies are velopment tool. The developers strove for a balance between
not made or distributed for profit or commercial advantage and that copies easiness of use and customizability. The toolkit also features
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific 2
http://www.mendeley.com/
permission and/or a fee. 3
SIGIR 2012 Workshop on Open Source Information Retrieval (OSIR 2012) http://www.citeulike.org/
4
Portland, Oregon, USA http://www.connotea.org/

9
a GUI for the process of search engine creation. Results pre- Tran [21] implemented a prototype of a support tool for
sentation follows common standards of popular web search the pearl growing stratagem. The tool shows citation rela-
engines. Support for complex search sessions, e.g. a tray or tionships between documents in a graph and allows the user
citation management tool are not included. to follow these relationships and keep track of the search
progress using document annotations. Figure 1 shows a
Evaluation Systems screenshot of a pearl growing session with some documents
The Lemur project5 includes a query log tool bar that can marked as relevant. It is planned to include this tool in ezDL
be used to capture usage data. It can collect queries as well in the near future.
as user interaction such as mouse activity and is available
as open source.
Bierig et al [7] presented an evaluation and logging frame-
work for user-centered and task-based experiments in in-
teractive information retrieval that focuses on “multidimen-
sional logging to obtain rich behavioural data” of searchers.

2. CONCEPTS
As a re-implementation of the Daffodil project [11], ezDL
builds on many of the same concepts and principles as Daf-
fodil. Like Daffodil it is a “search system for digital libraries
aiming at strategic support during the information search
process” [16]. Its primary target group is not that of ca-
sual users using a search system for short ad-hoc queries.
Instead the software aims to support searchers during com- Figure 1: A close-up of the pearl growing tool
plex information tasks by addressing all the steps in the Dig-
ital Library Life Cycle, as well as integrating search models
originally proposed by Marcia Bates [2, 3, 4].
Proactive support of higher-level activities, such as sug-
The Digital Library Life Cycle divides the information
gestion of tactics and stratagems for improvable search situa-
workflow into five phases [18], beginning with the discov-
tions [14, 15] or suggestion of search strategies with scaffold-
ery of information resources, which in ezDL is supported
ing support, is currently planned and will likely be available
through the Library Choice view. This is followed by the
for ezDL within the next nine months.
retrieval phase of information search, the collating of found
information using the personal library and tagging, inter-
preting the information, and finally the re-presenting phase 3. ARCHITECTURE
where new information is generated. In all phases, different
ezDL is a continuation of the Daffodil project and therefore
so-called tactics or stratagems can be employed by searchers
shares its main ideas: meta-search in digital libraries and
or information workers, which we try to support through
strategic support for users. Its overall architecture likewise
ezDL.
has inherited many features from Daffodil. Figure 2 provides
The notion of tactics and stratagems as higher-level search
a high-level overview of the system.
activities was introduced by Bates [2, 3, 4]. Based on search
The system architecture makes extensive use of separa-
tactics used by librarians and expert searchers, Bates de-
tion of concerns to keep interdependencies to a minimum
scribes basic moves, as well as higher-level tactics, stratagems,
and make the system more stable. This is true on the sys-
and finally strategies that build on lower-level activities.
tem level where a clear separation exists between clients and
ezDL already offers direct support for some of those higher-
backend, but also within the backend itself, where individ-
level activities, e.g. through the use of sharing functionali-
ual “agent” processes handle specific parts of the functional-
ties to support collaborative idea generation, through term
ity, and even within these agents. The desktop client, too,
suggestions of synonyms or spelling variants, extraction of
is separated into multiple independent components called
common results terms, or through icons in the result items
“tools”. ezDL is completely written in Java using common
that allow easy monitoring of performed activities.
frameworks and libraries.
During query formulation, ezDL provides term sugges-
tions to the user (e.g. synonyms and related terms). These 3.1 The Backend
are an example for the concept of proactive system support.
Bates describes “five levels of system involvement (SI) in The backend provides a large part of the core functional-
searching” [4]. The proactive support of ezDL belongs to ity of ezDL: the meta-search facility, user authorization, a
the third level, where a search system (through monitoring knowledge base about collected documents, as well as wrap-
of user activities) can react to the search situation with- pers and services that connect to external services. Func-
out prompting by the user. Users are informed of improve- tionality that provides collaboration support and allows stor-
ment options for their current move. Jansen and Pooch [12] ing of documents and queries in a personal library is also
demonstrated that proactive software agents assisting users located here.
during their search can result in improved performance of The right part of Figure 2 shows the structure of the
users. The effectiveness of such suggestions has also been backend. The components of the backend are agents: in-
shown for the Daffodil system [20]. dependent processes that provide a specific functionality to
the system. Agents use a common communication bus for
5
http://www.lemurproject.org/ transferring messages between each other.

10
Client 1

ACM
MTA 1 Wrapper
Client 2

IEEE
MTA 2 Wrapper
Client 3
Query Repository
Directory History Search
Agent Agent
Agent
DBLP
Wrapper
Client 4

Wiley
MTA n Wrapper

Communication bus (e.g. CORBA or JMS)


Client m

Backend

Figure 2: Overall architecture of ezDL

The Service-based Agent Infrastructure Example: Running Search Queries


Since every kind of functionality is taken care of by different If a client requests a search, it sends a request to the MTA
agents, the crash of one agent generally only disrupts this with a query in ezDL notation and a list of remote services
particular functionality. For example, if the search agent that the query should be run on. The request is handled
crashes, detail requests and the personal library are still by the MTA which forwards it to the Search agent. The
working. Also, it is possible to run multiple agents of each Search agent asks the Directory for the name of agents that
kind for load balancing and as a fail-safe mechanism. provide a connection to the remote services requested by the
Agents are subdivided into the main agent behaviour (reg- client. After receiving that list, the Search agent forwards
istering with the Directory, sending and receiving messages, the query to each of these agents. The agents then trans-
managing resources) and components that deal with specific late the query into something that the remote service under-
requests. These components—the request handlers—are in- stands and sends the answer of the remote service back to
dependent and process requests concurrently. the search agent. The search agent collects all answers from
Beginning on the left, an MTA (Message Transfer Agent) all the remote services, merges duplicates and reranks them.
is an agent that provides clients with a connection point to Reranking is either performed by using the original RSVs or
the backend. MTAs are responsible for authenticating users by using standard Lucene6 functionality. The answer set is
and translating requests from clients into messages to cer- then sent back to the MTA that requested the search. The
tain agents. E.g., if a client requests a search for a given MTA sends the answer to the client. The search agent also
query, the query from the client is translated into a message forwards the collected documents to the repository agent
to the Search Agent. This mechanism creates a clear separa- which is responsible for serving requests for details on doc-
tion between the client view of the system and the internal uments (e.g., if the user wants to see the full text).
workings: the client doesn’t have to know how many agents
are serving search queries and new search agents could be 3.2 The Frontend
instantiated as the system load demands. Currently there is There are multiple frontends for ezDL: among them the
only one MTA implementation, which uses a binary protocol basic desktop client and a web client. Specialized fron-
over a TCP connection, but it is possible to provide other tends exist for various applications (see Use Cases). Clients
protocols—e.g. SOAP—by using separate MTA implemen- for iOS and Android tablets are currently being developed.
tations. This subsection details the architecture of the desktop client,
The Directory is a special agent that keeps a list of agents since this is the main client for ezDL.
and the services they provide. Upon start, each agent reg-
isters with the Directory and announces the services it pro- Tools and Perspectives
vides. A tool comprises a set of logically connected functionalities.
The connection to remote (or local) search services (e.g., Each tool has one or more tool views, interactive display
digital libraries or information retrieval systems) is managed components that can be placed somewhere on the desktop.
by wrapper agents—in Figure 2 the four agents on the right A configuration of available tools and the specific layout of
hand side. They translate the internal query representation their tool views on the desktop is called a perspective. Users
of ezDL into one that the remote service can parse and trans- can modify existing predefined perspectives as well as create
lates the response of the remote service into an appropriate custom perspectives. The desktop client already has many
document representations to be handled by ezDL.
6
http://lucene.apache.org/core/

11
B
C
B

D
F

Figure 4: The Desktop client during a search session

are encapsulated and new ones can easily be added, as


can be new renderers for result surrogates.
• The Personal Library (B) allows to store documents
or queries persistently for authenticated users. Within
their personal collection, users can filter, group and
sort (e.g. by date of addition), organize the documents
with personal tags, and share them with other users.
Additional documents can be imported into the per-
sonal library as long as their metadata is available in
BibTeX format.
• The Search History (C) lists past queries for re-use and
allows grouping by date and filtering.
• The Detail View (D) shows additional details on indi-
Figure 3: Architecture of the desktop client vidual documents, such as thumbnails or short sum-
maries where available, or additional metadata not in-
cluded in the surrogate that is shown in the result list.
built-in tools and functionalities and can be easily extended A detail link can be provided to retrieve the fulltext.
(see Figure 4):
• A Tray (E) can be used to temporarily collect relevant
• The Search Tool (A) offers a variety of query forms for documents within a search session.
different purposes and views to present the results in
list or grid form, as well as a Library Choice view for Communication with the Backend
selecting information sources. Results can be sorted Like the backend, the desktop client uses a messaging in-
or grouped by different criteria, filtered, and exported. frastructure for communication between otherwise indepen-
An extraction function (F) can be used to extract fre- dent components. In Figure 3 a diagram of the components
quent terms, authors, or other features from the result is shown. On the left, four of the available tools can be
and visualize them in form of a list, a bar chart or a seen with their connection to the internal communication
term cloud. Grouping criteria or extraction strategies infrastructure (search, personal library, details, and query

12
most issues but the actual functionality. This is usually im-
plemented using specialized classes (request handlers) for
which there are abstract implementations, too. Thus, devel-
opers can concentrate on the business logic.
Connecting to a new collection for searching (a digital
library, a local IR system, a BibTeX file, etc.) is accom-
plished by implementing a wrapper agent. These are agents
specialized in translating between ezDL and a remote sys-
tem. Remote systems can be those that provide a stable
Figure 5: The search form with suggestions API like SOAP or SQL but also those that only have a
web site and a search form. ezDL has built-in support for
most common fields (e.g. title, author, publication year, ab-
stract) and data types (e.g. text, numbers, images). There
history). On the right hand side, a few subsystems are pre- are abstract wrappers available to quickly connect to a Solr8
sented, one of which is the external communication facility server. If required, web pages can be scraped using an elab-
that connects the client to the backend. orate tool kit that is configured by an XML file. Because of
As an example, if the user enters a query in the search this, even digital libraries without a proper API can be con-
tool and presses the “search” button, an internal message nected. Sometimes, digital libraries change their web page
is sent to the communication facility, which transmits the layout, breaking scripts that parse their HTML. Configuring
query to the backend. When the answer is received, the the page scraping using an XML file makes automatic repairs
communication facility routes the message back to the search of the configuration possible. See [10] for an example imple-
tool. mentation based on Daffodil, The approach outlined in this
Since from the client’s point of the view the backend is work uses repeated queries to infer the template elements
hidden behind the MTA, further details are omitted in the of the web pages and step-wise generalisation to find the
backend part of Figure 3. location of known information on the page.
There is also a library of code for translating the ezDL
Query Processing and Proactive Support query representation into other languages.
The queries that users enter are expressed in a grammar spe- Agents—and, thus, wrapper agents—announce themselves
cific to ezDL that is quite flexible and allows simple queries to the Directory agent when started. The client can ask
like term1 term2 term3 as well as more complicated ones the backend for a list of known wrapper agents, so there
like Title=term1 AND (term2 NEAR/2 term3). Internally, is no need to change any code or configuration outside of
the query is represented as a tree structure that can also the agent. This also enables developers to store the code
keep images as comparison values so ezDL can be used to and put it under version control independent from the main
specify image search queries.7 ezDL code.
During query formulation, the user’s interaction is ob- Often, services in the back end are used in the client in an
served by the system. If the system notices a break in individual tool. One example for this is the search facility,
the user’s typing, the query is processed by modules of the which consists of the search tool in the Swing client and the
proactive support subsystem that can either ask the backend search agent in the back end. Writing a new tool for the
for suggestions or calculate them directly in the frontend. Swing client can be done by implementing an OSGi plugin.
The suggestions can replace query terms e.g. by spelling The tool code itself is fairly simple since there is an abstract
corrections, insert new terms (e.g., for synonym expansion), implementation for the glue code. The remaining task is
or tag terms with concepts from an ontology. The ontology implementing a Swing GUI and communicating with the
items become part of the query so that a query can contain back end by firing events and listening for an answer.
both plain text terms and ontology terms. When suggestions
are found for a term, the term is marked by an underline in 4. EVALUATING SEARCH SYSTEMS
the query text field and a popup list is shown that presents
the suggestions (see Figure 5). To support user-centred evaluations, ezDL has a builtin eval-
uation mode that addresses many of the major challenges
3.3 Extending and Customizing ezDL inherent in setting up evaluation tasks and tracking user
activity during the experiments. The following is a brief
Each agent and the desktop client are extensible using a
overview of those functionalities within ezDL directly de-
plugin system. Plugins are registered at a central component
signed to support evaluations.
that can later be asked to return plugin objects of a specific
type. As an example, it is possible to add a new proactive Logging user actions
suggestion module to the system that implements a new way
of retrieving suggestions. Also the popup list that shows the For evaluations with actual users all user actions performed
suggestions can be replaced by an alternative. Further uses with the system should be logged for later inspection and
of the plugin system are export and import modules and analysis. ezDL has a built-in logging facility that stores all
modules that extract information from the result list. the interaction data of the user in a relational database (cur-
Adding a new service is usually done by implementing a rently mysql is used). A log session comprises all log events
new agent. There is an abstract class that takes care of that a user or the system has triggered. A log event has i) a
unique name identifying this type of event, ii) timestamps
7
This mechanism is used in the Khresmoi project (see Sec- from the frontend and the backend, iii) a sequence number
tion 5) to allow general physician and radiologists to search
8
for medical images. http://lucene.apache.org/solr/

13
to ensure the correct order, and iv) parameters as multiple
key/value pairs. For example, when a user performs a search
for information retrieval in the DBLP and ACM digital
library the corresponding log event may look like this:
event:
name: "search"
clientTimestamp: 1/4/2012 15:26:32,1234
timestamp: 1/4/2012 15:26:32,3456
sequenceNumber: 10
parameters:
query: "information retrieval"
sources: dblp, acm
The logging facility takes care about allocating activities to
sessions and users. If it is required to log some previously
unlogged action, this can be simply integrated by sending a
corresponding logging message to the backend. Figure 6: An ezDL client overlayed with AOIs for
eye tracking (from the HIIR project)
Tracking AOIs
Gaze tracking is a method for user-centred evaluation that
has recently gained popularity within the IR field [5, 7]. A
challenge for logging highly interactive systems with chang- science libraries, but including others like Pubmed and the
ing interfaces and moving components is keeping track of Amazon catalogue. The system is publicly available, still
their position, so that gaze points or fixation data of users under constant development and updated regularly.
can be aggregated across so-called Areas of Interest (AOIs, Khresmoi10 is a four year project funded by the EC, which
see Figure 6). This feature has been integrated into ezDL aims at building a multilingual and multimodal search sys-
with the help of the AOILog framework [6]. tem for biomedical documents. The ezDL framework is used
for all user interfaces developed within the project. These
Fixed layout on screen include variations of the stand-alone Swing client, such as an
interface for search medical images, including 3D data. An-
The layout of the desktop can be locked to keep UI-related
other version of the interface will be customized to the needs
variance low. With a fixed layout it is no longer possible for
of general practitioners. For casual users with health related
a test subject to open additional tool views or change the
information needs, an easy to use web based interface (see
layout of the desktop.
Figure 7) is under development. From a functional point of
Loading predefined perspectives view numerous new data sources were made available. The
set of searchable data types was extended to cover the spe-
Predefined perspectives can be loaded immediately after the cific demands of the medical domain. The system allows the
system has been started. This allows the evaluator to cre- user to specify an image as a query to perform a similarity
ate custom perspectives that can be used for an evaluation search. Additional collaborative and social functions will be
without selecting them manually. added to the full client in later versions.
Splash screen for choosing evaluation settings Within the INEX 2010 Interactive Track ezDL was used
to observe how users act during their search sessions [19].
A splash screen can be enabled that is shown before starting Valuable insights on user behaviour were gained. An appli-
the system. It can be used to choose and set settings for cation for viewing the logged data and a questionnaire tool
the evaluation session, e.g. a search task description or the controlling the experiment flow have been implemented.
system variant when doing a comparison of different UIs or For the ongoing CAIR11 project an advanced 2010 INEX
system features. ezDL version is used. To answer the question whether clus-
Several user studies have been performed and experimen- tering of results can improve efficiency of searches with vague
tal systems implemented using ezDL as a base system. The information needs ezDL was expanded by a clustering ser-
next section will present some of them in more detail. vice and visualization [17] (see Figure 8). New data sources
and a browser were added to the system. For the evaluation
5. USE CASES a task selection and a questionnaire tool were developed.
ezDL is currently running as a live system, and is being used Log data generated by ezDL can be analysed automatically.
and extended in a number of projects of various sizes. The AOI logging framework mentioned in Section 4 was
The live system9 features all core functionalities that part implemented as part of the HIIR (Highly Interactive IR)
of a more specific project. These include a simple and an ad- project12 . The project’s goal is improving interaction with
vanced search function, various result manipulation options, the system by considering the users cognitive efforts [23, 22].
a temporary document store, and exporting of meta infor-
mation (e.g. in BibTeX format). Registered users can also 6. CONCLUSION AND OUTLOOK
use a personal library to store, annotate and share found 10
or imported documents. Currently, nine different digital li- http://www.khresmoi.eu/
11
braries are connected to the system focusing on computer http://www.uni-weimar.de/cms/index.php?id=17632
12
http://www.is.inf.uni-due.de/projects/hiir/index.
9
http://www.ezdl.de/ html.en

14
Figure 7: The web client used in the Khresmoi project

[3] M. J. Bates. Information search tactics. Journal of the


We presented ezDL, which is a framework system for inter- American Society for Information Science,
active retrieval and its evaluation. Building upon state-of- 30(4):205–214, 1979.
the art interface technology and usability results, ezDL can [4] M. J. Bates. Where should the person stop and the
provide an advanced user interface for many IR applications. search interface start? Information Processing and
The system can also be easily extended, at the functionality Management, 26(5):575–591, 1990.
level as well as at the presentation level; thus, new concepts [5] T. Beckers and N. Fuhr. User-oriented and
for the design of IR user interfaces can be integrated into eye-tracking-based evaluation of an interactive search
ezDL with little effort. Furthermore, the system provides system. In 4th Workshop on Human-Computer
extensive support for performing user-oriented evaluations. Interaction and Information Retrieval (HCIR 2010) @
In the same way as there are various experimental IR back- IIiX 2010, 2010.
end systems, there is now an IR frontend system that allows [6] T. Beckers and D. Korbar. Using eye-tracking for the
for easy experimentation and application of interactive re- evaluation of interactive information retrieval. In
trieval. Proceedings of INEX 2010, pages 236–240, 2011.
[7] R. Bierig, J. Gwizdka, and M. Cole. A User-Centered
Acknowlegments Experiment and Logging Framework for Interactive
Information Retrieval. In Understanding the user -
The research leading to these results has received funding
workshop in conjuction with SIGIR’09, 2009.
from the European Union Seventh Framework Programme
›
(FP7/2007-2013) under grant agreement 257528 (KHRES-
[8] M. Chau and C. H. Wong. Designing the user
interface and functions of a search engine development
MOI). Parts of this work were supported by the German
tool. Decision Support Systems, 48(2):369 – 382, 2010.
Science Foundation under grant nos. FU 205/24-1 (HIIR)
and FU 205/22-1 (CAIR). [9] A. Diriye and G. Golovchinsky. Querium: a
session-based collaborative search system. In
Proceedings of the 34th European conference on
7. REFERENCES Advances in Information Retrieval, ECIR’12, pages
583–584, Berlin, Heidelberg, 2012. Springer-Verlag.
[1] S. Amershi and M. R. Morris. Cosearch: a system for
co-located collaborative web search. In Proceedings of [10] A. Ernst-Gerlach. Semiautomatisches Pflegen von
the twenty-sixth annual SIGCHI conference on Human Wrappern. Diplomarbeit, Universität Dortmund, FB
factors in computing systems, CHI ’08, pages Informatik, 2004.
1647–1656, New York, NY, USA, 2008. ACM. [11] N. Fuhr, C.-P. Klas, A. Schaefer, and P. Mutschke.
[2] M. J. Bates. Idea tactics. Journal of the American Daffodil: An integrated desktop for supporting
Society for Information Science, 30(5):280–289, 1979. high-level search activities in federated digital

15
Figure 8: The ezDL-based system extended with clustering functionality developed within the CAIR project

libraries. In Research and Advanced Technology for 2(5), May 1996. http://www.dlib.org/dlib/may96/
Digital Libraries. 6th European Conference, ECDL stanford/05paepcke.html.
2002, pages 597–612, Heidelberg et al., 2002. Springer. [19] N. Pharo, T. Beckers, R. Nordlie, and N. Fuhr.
[12] B. J. Jansen and U. Pooch. Assisting the searcher: Overview of the inex 2010 interactive track. In
utilizing software agents for web search systems. Proceedings of the 9th International Workshop of the
Internet Research: Electronic Networking Applications Initiative for the Evaluation of XML Retrieval (INEX
and Policy, 14(1):19–33, 2004. 2010), 2011.
[13] C.-P. Klas, S. Kriewel, and A. Schaefer. Daffodil - [20] A. Schaefer, M. Jordan, C.-P. Klas, and N. Fuhr.
Nutzerorientiertes Zugangssystem für heterogene Active support for query formulation in virtual digital
digitale Bibliotheken. dvs Band, 2004. libraries: A case study with DAFFODIL. In
[14] S. Kriewel. Unterstützung beim Finden und A. Rauber, C. Christodoulakis, and A. M. Tjoa,
Durchführen von Suchstrategien in Digitalen editors, Research and Advanced Technology for Digital
Bibliotheken. PhD thesis, University of Libraries. Proc. European Conference on Digital
Duisburg-Essen, 2010. Libraries (ECDL 2005), Lecture Notes in Computer
[15] S. Kriewel and N. Fuhr. An evaluation of an adaptive Science, Heidelberg et al., 2005. Springer.
search suggestion system. In 32nd European [21] V. T. Tran. Entwicklung einer Unterstützung für
Conference on Information Retrieval Research (ECIR Pearl Growing. Diplomarbeit, Universität
2010), 2010. Duisburg-Essen, 2011.
[16] S. Kriewel, C.-P. Klas, A. Schaefer, and N. Fuhr. [22] V. T. Tran and N. Fuhr. Quantitative analysis of
Daffodil - strategic support for user-oriented access to search sessions enhanced by gaze tracking with
heterogeneous digital libraries. D-Lib Magazine, 10(6), dynamic areas of interest. In The International
June 2004. available at http://www.dlib.org/dlib/ Conference on Theory and Practice of Digital
june04/kriewel/06kriewel.html. Libraries 2012. Springer tbp, September 2012.
[17] M. Lechtenfeld and N. Fuhr. Result clustering [23] V. T. Tran and N. Fuhr. Using eye-tracking with
supports users with vague information needs. In dynamic areas of interest for analyzing interactive
Proceedings of the 12th Dutch-Belgian Information information retrieval. In Proceedings of the 35th
Retrieval Workshop 2012, Ghent, Belgium, February international ACM SIGIR conference on Research and
2012. development in Information Retrieval. ACM tbp,
[18] A. Paepcke. Digital libraries: Searching is not August 2012.
enough–what we learned on-site. D-Lib Magazine,

16
Apache Lucene 4
Andrzej Białecki, Robert Muir, Grant Ingersoll
Lucid Imagination
{andrzej.bialecki, robert.muir, grant.ingersoll}@lucidimagination.com

ABSTRACT other search-based applications [6]. Lucene has also spawned


Apache Lucene is a modern, open source search library designed several search-based services such as Apache Solr [7] that provide
to provide both relevant results as well as high performance. extensions, configuration and infrastructure around Lucene as
Furthermore, Lucene has undergone significant change over the well as native bindings for programming languages other than
years, starting as a one-person project to one of the leading search Java. As of this writing, Lucene 4.0 is on the verge of being
solutions available. Lucene is used in a vast range of applications officially released (it likely will be released by the time of
from mobile devices and desktops through Internet scale publication) and represents a significant milestone in the
solutions. The evolution of Lucene has been quite dramatic at development of Lucene due to a number of new features and
times, none more so than in the current release of Lucene 4.0. efficiency improvements as compared to previous versions of
This paper presents both an overview of Lucene’s features as well Lucene. This paper’s focus will primarily be on Lucene 4.0.
as details on its community development model, architecture and The main capabilities of Lucene are centered on the creation,
implementation, including coverage of its indexing and scoring maintenance and accessibility of the Lucene inverted index [31].
capabilities. After reviewing Lucene’s background in section 2 and related
work in section 3, the remainder of this paper will focus on the
Categories and Subject Descriptors features, architecture and open source development methodology
H.3.3 [Information Search and Retrieval]: Information Search used in building Lucene 4.0. In Section 4 we’ll provide a broad
and Retrieval overview of Lucene’s features. In section 5, we’ll examine
Lucene’s architecture and functionality in greater detail by
looking at how Lucene implements its indexing and querying
General Terms capabilities. Section 6 will detail Lucene’s open source
Algorithms, Performance, Design, Experimentation development model and how it directly contributes to the success
of the project. Section 7 will provide a meta-analysis of Lucene’s
Keywords performance in various search evaluations such as TREC, while
Information Retrieval, Open Source, Apache Lucene. section 8 and 9 will round out the paper with a look at the future
of Lucene and the conclusions that can be drawn from this paper,
the project and the broader Lucene community.
1. INTRODUCTION
Apache Lucene is an open source Java-based search library
providing Application Programming Interfaces for performing 2. BACKGROUND
common search and search related tasks like indexing, querying, Originally started in 1997 by Doug Cutting as a means to learning
highlighting, language analysis and many others. Lucene is Java [8] and subsequently donated to The Apache Software
written and maintained by a group of contributors and committers Foundation (ASF) in 2001 [9], Lucene has had 32 official releases
of the Apache Software Foundation (ASF) [1] and is licensed encompassing major, minor and patch releases [10, 11]. The most
under the Apache Software License v2 [2]. It is built by a loosely current of those releases, at the time of writing is Lucene 3.6.0.
knit community of “volunteers” (as the ASF views them, most From its earliest days, Lucene has implemented a modified vector
contributors are paid to work on Lucene by their respective space model that supports incremental modifications to the index
employers) following a set of principles collectively known as the [12, 19, 37]. For querying, Lucene has developed extensively
“Apache Way” [3]. from the first official ASF release of 1.2. However even from the
Today, Lucene enjoys widespread adoption, powering search on 1.2 release, Lucene supported a variety of query types, including:
many of today’s most popular websites, applications and devices, fielded term with boosts, wildcards, fuzzy (using Levenshtein
such as Twitter, Netflix and Instagram [20, 4, 5] as well as many Distance [13]), proximity searches and boolean operators (AND,
OR, NOT) [14]. Lucene 3.6.0 continues to support all of these
queries and the many more that have been added throughout the
Permission to make digital or hard copies of all or part of this work for lifespan of the project, including support for regular expressions,
personal or classroom use is granted without fee provided that copies are
complex phrases, spatial distances and arbitrary scoring functions
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
based on the values in a field (e.g. using a timestamp or a price as
otherwise, or republish, to post on servers or to redistribute to lists, a scoring factor) [10]. For more information on these features and
requires prior specific permission and/or a fee. Lucene 3 in general, see [15].
SIGIR 2012 Workshop on Open Source Information Retrieval.
August 16, 2012, Portland, OR USA. Three years in the making, Lucene 4.0 builds on the work of a
number of previous systems and ideas, not just Lucene itself.

17
Lucene incorporates a number of new models for calculating • Segmented indexing with merging and pluggable merge
similarity, which will be described later. Others have also policies [19]
modified Lucene over the years as well: [16] modified Lucene to
add BM25 and BM25F; [17] added “sweet spot similarity” and • Abstractions to allow for different strategies for I/O,
ILPS at the U. of Amsterdam has incorporated language modeling storage and postings list data structures [36]
into Lucene [18]. Lucene also includes a number of new • Transactional support for additions and rollbacks
abstractions for logically separating out the index format and
related data structures (Lucene calls them Codecs and they are • Support for a variety of term, document and corpus
similar in theory to Xapian’s Backends [32]) from the storage level statistics enabling a variety of scoring models [24].
layer - see the section Codec API for more details.
4.3 Querying
3. RELATED WORK On the search side, Lucene supports a variety of query options,
There are numerous open source search engines available today along with the ability to filter, page and sort results as well as
[30], with different feature sets, performance characteristics, and perform pseudo relevance feedback. For querying, Lucene
software licensing models. Xapian [32] is a portable IR library provides over 50 different kinds of query representations, as well
written in the C++ programming language that supports as several query parsers and a query parsing framework to assist
probabilistic retrieval models. The Lemur Project [33] is a toolkit developers in writing their own query parser [24]. More
for language modeling and information retrieval. The Terrier IR information on query capabilities will be provided later.
platform [34] is an open-source toolkit for research and
experimentation that supports a large variety of IR models. Additionally, Lucene 4.0 now supports a completely pluggable
Managing Gigabytes For Java (MG4J) [35] is a free full-text scoring model [24] system that can be overridden by developers.
search engine designed for large document collections. It also ships with several pre-defined models such as Lucene’s
traditional vector-space scoring model, Okapi BM25 [21],
4. LUCENE 4 FEATURES Language Modeling [25], Information Based [22] and Divergence
Lucene 4.0 consists of a number of features that can be broken from Randomness [23].
down into four main categories: analysis of incoming content and
queries, indexing and storage, searching, and ancillary modules 4.4 Ancillary Features
(everything else). The first three items contribute to what is Lucene’s ancillary modules contain a variety of capabilities
commonly referred to as the core of Lucene, while the last commonly used in building search-based applications. These
consists of code libraries that have proven to be useful in solving libraries consist of code that is not seen as critical to the indexing
search-related problems (e.g. result highlighting.) and searching process for all people, but nevertheless useful for
many applications. They are packaged separately from the core
4.1 Language Analysis Lucene library, but are released at the same time as the core and
The analysis capabilities in Lucene are responsible for taking in share the core’s version number. There are currently 13 different
content in the form of documents to be indexed or queries to be modules and they include code for performing: result highlighting
searched and converting them into an appropriate internal (snippet generation), faceting, spatial search, document grouping
representation that can then be used as needed. At indexing time, by key (e.g. group all documents with the same base URL
analysis creates tokens that are ultimately inserted into Lucene’s together), document routing (via an optimized, in-memory, single
inverted index, while at query time, tokens are created to help document index), point-based spatial search and auto-suggest.
form appropriate query representations. The analysis process
consists of three tasks which are chained together to operate on 5. ARCHITECTURE AND
incoming content: 1) optional character filtering and IMPLEMENTATION
normalization (e.g. removing diacritics), 2) tokenization, and 3)
Lucene’s architecture and implementation has evolved and
token filtering (e.g. stemming, lemmatization, stopword removal,
improved significantly over its lifetime, with much of the work
n-gram creation). Analysis is described in greater detail in the
focused around usability and performance, with the work often
section on Lucene’s document model below.
falling into the areas of memory efficiencies and the removal of
synchronizations. In this section, we’ll detail some of the
4.2 Indexing and Storage commonly used foundation classes of Lucene and then look at
Lucene’s indexing and storage layers consist of the following how indexing and searching are built on top of these. To get
primary features, many of which will be discussed in greater started, Figure 1 illustrates the high-level architecture of Lucene
detail in the Architecture and Implementation section: core.
• Indexing of user defined documents, where documents
can consist of one or more fields containing the content 5.1 Foundations
to be processed and each field may or may not be There are two main foundations of Lucene 4: text analysis and our
analyzed using the analysis features described earlier. use of finite state automata, both of which will be discussed in the
subsections below.
• Storage of user defined documents.
5.1.1 Text Analysis
• Lock-free indexing [20] The text analysis chain produces a stream of tokens from the input
• Near Real Time indexing enabling documents to be data in a field (Figure 3). Tokens in the analysis chain are
searchable as soon as they are done indexing represented as a collection of “attributes”. In addition to the
expected main “term” attribute that contains the token value there

18
Figure 2 Structure of a Lucene segment.

(referred to as “segments”) that are periodically merged into


larger segments to minimize the total number of index parts [19].
Figure 1 Lucene's Architecture
5.2.1 Document Model
can be many other attributes associated with a token, such as Documents are modeled in Lucene as a flat ordered list of fields
token position, starting and ending offsets, token type, arbitrary with content. Fields have name, content data, float weight (used
payload data (a byte array to be stored in the index at the current later for scoring), and other attributes, depending on their type,
position), integer flags, and other custom application-defined which together determine how the content is processed and
attributes (e.g. part-of-speech tags). represented in the index. There can be multiple fields with the
Analysis chains consist of character filters (useful for stripping same name in a document, in which case they will be processed
diacritics, for instance), tokenizers (which are the sources of token sequentially. Documents are not required to have a unique
streams) and series of token filters that modify the original token identifier (though they often carry a field with this role for
stream. Custom token attributes can be used for passing bits of application-level unique key lookup) - in the process of indexing
per-token information between the elements of the chain. documents are assigned internal integer identifiers.
Lucene includes a total of five character filtering 5.2.2 Field Types
implementations, 18 tokenization strategies and 97 token filtering
There are two broad categories of fields in Lucene documents -
implementations and covers 32 different languages [24]. These
those that carry content to be inverted (indexed fields) and those
token streams performing specific functions such as tokenization
with content to be stored as-is (stored fields). Fields may belong
by patterns, rules and dictionaries (e.g. whitespace, regex, Chinese
to either or both categories (e.g. with content both to be stored and
/ Japanese / Korean, ICU), specialized token filters for efficient
inverted). Both indexed and stored fields can be submitted for
indexing of numeric values and dates (to support trie-based
storing / indexing, but only stored fields can be retrieved - the
numerical range searching), language-specific stemming and stop
inverted data can be accessed and traversed using a specialized
word removal, creation of character or word-level n-grams,
API.
tagging (UIMA), etc. Using these existing building blocks, or
Indexed fields can be provided in plain text, in which case it will
custom ones, it’s possible to express very complex text analysis
be first passed through text analysis pipeline, or in its final form
pipelines.
of a sequence of tokens with attributes (so called “token stream”).
5.1.2 Finite State Automata Token streams are then inverted and added to in-memory
Lucene 4.0 requires significantly less main memory than previous segments, which are periodically flushed and merged. Depending
releases. The in-memory portion of the inverted index is on the field options, various token attributes (such as positions,
implemented with a new finite state transducer (FST) package. starting / ending offsets and per-position payloads) are also stored
Lucene’s FST package supports linear time construction of the with the inverted data. It’s possible e.g. to omit positional
minimal automaton [38], FST compression [39], reverse lookups, information while still storing the in-document term frequencies,
and weighted automata. Additionally, the API supports pluggable on a per-field basis [36].
output algebras. Synonym processing, Japanese text analysis, spell A variant of an indexed field is a field where the creation and
correction, auto-suggest are now all based on Lucene’s automata storage of term frequency vectors was requested. In this case the
package, with additional improvements planned for future token stream is used also for building a small inverted index
releases. consisting of data from the current field only, and this inverted
data is then stored on a per-document and per-field basis. Term
5.2 Indexing frequency vectors are particularly useful when performing
Lucene uses the well-known inverted index representation, with document highlighting, relevance feedback or when generating
additional functionality for keeping adjacent non-inverted data on search result snippets (region of text that best matches the query
a per-document basis. Both in-flight and persistent data uses terms).
variety of encoding schemas that affect the size of the index data Stored fields are typically used for storing auxiliary per-document
and the cost of the data compression. Lucene uses pluggable data that is not searchable but would be cumbersome to obtain
mechanisms for data coding (see the section on Codec API below) otherwise (e.g. it would require retrieval from a separate system).
and for the actual storage of index data (Directory API). This data is stored as byte arrays, but can be manipulated through
Incremental updates are supported and stored in index extents a more convenient API that presents it as UTF-8 strings, numbers,

19
arrays etc., or optionally it can be stored using strongly typed API As new documents are being added and in-memory segments are
(so called “doc values”) that can use a more optimized storage being flushed to storage, periodically an index compaction
format. This kind of strongly typed storage is used for example to (merging) is executed in the background that reduces the total
store per-document and per-field weights (so called “norms”, as number of segments that comprise the whole index.
they typically correspond to field length normalization factor that Document deletions are expressed as queries that select (using
affects scoring). boolean match) the documents to be deleted. Deletions are also
accumulated, applied to the in-memory segments before flushing
5.2.3 Indexing Chain (while they are still mutable) and also recorded in a commit point
The resulting token stream is finally processed by the indexing so that they can be resolved when reading the already flushed
chain and the supported attributes (term value, position, offsets immutable segments.
and payload data) are added to the respective posting lists for each
term (Figure 3). Term values don’t have to be UTF-8 strings as in Each flush operation or index compaction creates a new commit
previous versions of Lucene - version 4.0 fully supports arbitrary point, recorded in a global index structure using a two-phase
byte array values as terms, and can use custom comparators to commit. The commit point is a list of segments and deletions
define the sorting order of such terms. comprising the whole index at the point in time when the commit
operation was successfully completed. Segment data that is being
Also at this stage documents are assigned their internal document flushed from in-memory segments is encoded using the
configured Codec implementation (see the section below).
In Lucene 3.x and earlier some segment data was mutable (for
example, the parts containing deletions or field normalization
weights), which negatively affected the concurrency of writes and
reads - to apply any modifications the index had to be locked and
it was not possible to open the index for reading until the update
operation completed and the lock was released.
In Lucene 4.0 the segments are fully immutable (write-once), and
any changes are expressed either as new segments or new lists of
deletions, both of which create new commit points, and the
updated view of the latest version of the index becomes visible
when a commit point is recorded using a two-phase commit. This
enables lock-free reading operations concurrently with updates,
and point-in-time travel by opening the index for reading using
some existing past commit point.
Figure 3 Indexing Process 5.3.2 The IndexReader Class
The IndexReader provides high-level methods to retrieve stored
identifiers, which are small sequential integers (for efficient delta
fields, term vectors and to traverse the inverted index data. Behind
compression). These identifiers are ephemeral - they are used for
the scenes it uses the Codec API to retrieve and decode the index
identifying document data within a particular segment, so they
data (Figure 1).
naturally change after two or more segments are merged (during
index compaction). The IndexReader represents the view of an index at a specific
point in time. Typically a user obtains an IndexReader from either
5.3 Incremental Index Updates a commit point (where all data has been written to disk), or
Indexes can be updated incrementally on-line, simultaneously directly from IndexWriter (a “near-realtime” snapshot that
with searching, by adding new documents and/or deleting existing includes both the flushed and the in-memory segments).
ones (sub-document updates are a work in progress). Index As mentioned in the previous section, segments are immutable so
extents are a common way to implement incremental index the deletions don’t actually remove data from existing segments.
updates that don’t require modifying the existing parts of the Instead the delete operations are resolved when existing segments
index [19]. are open, so that the deletions are represented as a bitset of live
When new documents are submitted for indexing, their fields (not deleted) documents. This bitset is then used when
undergo the process described in the previous section, and the enumerating postings and stored fields and during search to hide
resulting inverted and non-inverted data is accumulated in new in- deleted documents. Global index statistics are not recalculated, so
memory index extents called “segments” (Figure 2), using a they are slightly wrong (they include the term statistics of postings
compact in-memory representation (a variant of Codec - see that belong to deleted documents). For performance reasons the
below). Periodically these in-memory segments are flushed to a data of deleted documents is actually removed only during
persistent storage (using the Codec and Directory abstractions), segment merging, and then also the global statistics are
whenever they reach a configurable threshold - for example, the recalculated.
total number of documents, or the size in bytes of the segment.
The IndexReader API follows the composite pattern: an
5.3.1 The IndexWriter Class IndexReader representing a specific commit point is actually a list
The IndexWriter is a high-level class responsible for processing of sub-Readers for each segment. Composed IndexReaders at
index updates (additions and deletions), recording them in new different points in time share underlying subreaders with each
segments and creating new commit points, and occasionally other when possible: this allows for efficient representation of
triggering the index compaction (segment merging). It uses a pool multiple point-in-time views. An extreme example of this is the
of DocumentWriter-s that create new in-memory segments.

20
Twitter search engine, where each search operation obtains a new several modern codecs, including PForDelta, Simple9/16/64 (both
IndexReader [20]. likely to be included in Lucene 4.0) and VSEncoding [26], and
experimenting with other representations for the term dictionary
5.4 Codec API (e.g. using Finite State Transducers).
While Lucene 3.x used a few predefined data coding algorithms (a
The Codec API opens up many possibilities for runtime
combination of delta and variable-length byte coding), in Lucene
manipulation of postings during writing or reading (e.g. online
4.0 all parts of the code that dealt with coding and compression of
pruning and sharding, adding Bloom filters for fail-fast lookups
data have been separated and grouped into a Codec API.
etc.), or to accommodate specific limitations of the underlying
This major re-design of Lucene architecture has opened up the storage (e.g. Appending codec that can work with append-only
library for many improvements, customizations and for filesystems such as Hadoop DFS).
experimentation with recent advances in inverted index
compression algorithms. The Codec API allows for complete 5.4.3 Directory API
customization of how index data is encoded and written out to the Finally, the physical I/O access is abstracted using the Directory
underlying storage: the inverted and non-inverted parts, how it’s API that offers a very simple file system-like view of persistent
decoded for reading and how segment data is merged. The storage. The Lucene Directory is basically a flat list of “files”.
following section explains in more detail how inverted data is Files are write-once, and abstractions are provided for sequential
represented using this API. and random access for writing and reading of files.
This abstraction is general enough and limited enough that
5.4.1 A 4-D View of the Inverted Index implementations exist both using java.io.File, NIO buffers, in
The Codec API presents inverted index data as a logical four-
memory, distributed file systems (e.g. Amazon S3 or Hadoop
dimensional table that can be traversed using enumerators. The
HDFS), NoSQL key-value stores and even traditional SQL
dimensions are: field, term, document, and position - that is, an
databases.
imaginary cursor can be advanced along rows and columns of this
table in each dimension, and it supports both “next item” and 5.5 SEARCHING
“seek to item” operations, as well as retrieving row and cell data Lucene’s primary searching concerns can be broken down into a
at the current position. For example, given a cursor at field f1 and few key areas, which will be discussed in the following
term t1 the cursor can be advanced along this posting list to the subsections: Lucene’s query model, query evaluation, scoring and
data for document d1, where the in-document frequency for this common search extensions. We’ll begin by looking at how
term (TF) can be retrieved, and then positional data can be iterated Lucene models queries.
to retrieve consecutive positions, offsets and payload data at each
position within this document. 5.5.1 Query Model and Types
This level of abstraction is sufficient to not only support many Lucene does not enforce a particular query language: instead it
types of query evaluation strategies, but to also clearly separate uses Query objects to perform searches. Several Queries are
how the underlying data structures should be organized and provided as building blocks to express complex queries, and
encoded and to encapsulate this concern in Codec developers can construct their own programmatically or via a
implementations. Query Parser.
Query types provided in Lucene 4.0 include: term queries that
5.4.2 Lucene 4.0 Codecs evaluate a single term in a specific field; boolean queries
The default codec implementation (aptly named “Lucene40”) uses (supporting AND, OR and NOT) where clauses can be any other
a combination of well-known compression algorithms and Query; proximity queries (strict phrase, sloppy phrase that allows
strategies selected to provide a good tradeoff between index size for up to N intervening terms) [40, 41]; position-based queries
(and related costs of I/O seeks) and coding costs. Byte-aligned (called “spans” in Lucene parlance) that allow to express more
coding is preferred for its decompression speed - for example, complex rules for proximity and relative positions of terms;
posting lists data uses variable-byte coding of delta values, with wildcard, fuzzy and regular expression queries that use automata
multi-level skip lists, using the natural ordering of document for evaluating matching terms; disjunction-max query that assigns
identifiers, and interleaving of document ID-s and position data scores based on the best match for a document across several
[36]. For frequently occurring very short lists (according to the fields; payload query that processes per-position payload data, etc.
Zipf’s law) the codec switches to using the “pulsing” strategy that Lucene also supports the incorporation of field values into
inlines postings with the term dictionary [19]. The term dictionary scoring. Named “function queries”, these queries can be used to
is encoded using a “block tree” schema that uses shared prefix add useful scoring factors like time and distance into the scoring
deltas per block of terms (fixed-size or variable-size) and skip model.
lists. The non-inverted data is coded using various strategies, for
example per-document strongly typed values are encoded using This large collection of predefined queries allows developers to
fixed-length bit-aligned compression (similar to Frame-of- express complex criteria for matching and scoring of documents,
Reference coding), while the regular stored field data uses no in a well-structured tree of query clauses.
compression at all (applications may of course compress Typically a search is parsed by a Query Parser into a Query tree,
individual values before storing). but this is not mandatory: queries can also be generated and
The Lucene40 codec offers, in practice, a good balance between combined programmatically. Lucene ships with a number of
high performance indexing and fast execution of queries. Since different query parsers out of the box. Some are based on JavaCC
the Codec API offers a clear separation between the functionality grammars while others are XML based. Details on these query
of the inverted index and the details of its data formats, it’s very parsers and the framework is beyond the scope of this paper.
easy in Lucene 4.0 to customize these formats if the default codec
is not sufficient. The Lucene community is already working on

21
5.5.2 Query Evaluation Development is undertaken as a loose federation of programmers
When a Query is executed, each inverted index segment is coordinating development through the use of mailing lists, issue
processed sequentially for efficiency: it is not necessary to operate tracking software, IRC channels and the occasional face-to-face
on a merged view of the postings lists. For each index segment, meeting. While all committers may veto someone else’s changes,
the Query generates a Scorer: essentially an enumerator over the these rarely happen in practice due to coordination via the
matching documents with an additional score() method. communication mechanisms mentioned. Project planning is very
lightweight and is almost always coordinated by patches to the
Scorers typically score documents with a document-at-a-time code that demonstrate the desired feature to some level more than
(DAAT) strategy, although the commonly used BooleanScorer abstract discussions about potential implementations. Releases
sometimes uses a TAAT (term-at-a-time)-like strategy when the are the coordinated effort of a community-selected (someone
number of terms is low [27]. usually volunteers) release manager and a grouping of other
Scorers that are “leaf” nodes in the Query tree typically compute people who validate release candidates and vote to release the
the score by passing raw index statistics (such as term frequency) necessary libraries. Lucene developers also strive to make sure
to the Similarity, which is a configurable policy for term ranking. that backwards compatibility (breakages, when known, are
Scorers higher-up in the tree usually operate on sub-scorers, e.g. a explicitly documented) is maintained between minor versions and
Disjunction scorer might compute the sum of its children’s scores. that all major version upgrades are able to consume the index of
Finally, a Collector is responsible for actually consuming these the last minor version of the previous release, thereby reducing
Scorers and doing something with the results: for example the cost of upgrades.
populating a priority queue of the top-N documents [42]. Lucene developers are often faced with the need to make tradeoffs
Developers can also implement custom Collectors for advanced between speed, index size and memory consumption, since
use cases such as early termination of queries, faceting, and Lucene is used in many demanding environments (Twitter, for
grouping of similar results. example, processes, as of Fall 2011, 250 million tweets and
billions of queries per day, all with an average query latency of 50
5.5.3 Similarity milliseconds or less [20].) For instance, the default Lucene40
The Similarity class implements a policy for scoring terms and codec uses relatively simple compression algorithms that trade
query clauses, taking into account term and global index statistics index size for speed; field normalization factors use encoding that
as well as specifics of a query (e.g. distance between terms of a fits a floating point weight in a single byte, with a significant loss
phrase, number of matching terms in a multi-term query, of precision but with great savings in storage space; large data
Levenshtein edit distance of fuzzy terms, etc). Lucene 4 now structures (such as term dictionary and posting lists) are often
maintains several per-segment statistics (e.g. total term frequency, accompanied by skip lists that are cached in memory, while the
unique term count, total document frequency of all terms, etc) to main data is retrieved in chunks and not buffered in the process’
support additional scoring models. memory, relying instead on disk buffers of the operating system
As a part of the indexing chain this class is responsible for for efficient LRU caching.
calculating the field normalization factors (weights) that usually Lucene 2, 3 and Lucene 4 have seen a significant effort to employ
depend on the field length and arbitrary user-specified field engineering best practices across the code base. At the center of
boosts. However, the main role of this class is to specify the these best practices is a test-driven development approach
details of query scoring during query evaluation. designed to insure correctness and performance. For instance,
As mentioned earlier, Lucene 4 provides several Similarity Lucene has an extensive suite of tests (for example, as of
implementations that offer well-known scoring models: TF/IDF 7/1/2012, Lucene has 79% test coverage on 1 sample run at
with several different normalizations, BM25, Information-based, https://builds.apache.org/job/Lucene-trunk/clover/) and bench-
Divergence from Randomness, and Language Modeling. marking capabilities that are designed to push Lucene to its limits.
These tests are all driven by a test framework that supports the de
5.5.4 Common Search Extensions facto industry standard notion of unit tests, but also the emerging
Keyword search is only a part of query execution for many focus on randomization of tests. The former approach is primarily
modern search systems. Lucene provides extended query used to test “normal” operation, while the latter, when run
processing capabilities to support easier navigation of search regularly (this happens many times throughout the day on
results. The faceting module allows for browsing/drilldown Lucene’s continuous integration system), is designed to catch
capabilities, which is common in many e-commerce applications. edge cases beyond the scope of developers.
Result grouping supports folding related documents (such as those Since many things in Lucene are pluggable, randomly assembling
appearing on the same website) into a single combined result. these parts and then running the test suite uncovers many edge
Additional search modules provide support for nested documents, cases that are simply too cumbersome for developers to code up
query expansion, and geospatial search. manually. For instance, a given test run may randomize the
Codec used, the query types, the Locale, the character encoding of
6. Open Source Engineering documents, the amount of memory given to certain subsystems
Lucene’s development is a collaboration of a broad set of
and much, much more. The same test run again later (with a
contributors along with a core set of committers who have
different random seed) would likely utilize a different
permission to actually change the source code hosted at the ASF.
combination of implementations. Finally, Lucene also has a suite
At the heart of this approach is a meritocratic model whereby
of tests for doing large scale indexing and searching tasks. The
permissions to the code and documentation are granted based on
results of these tests are tracked over time to provide better
contributions (both code-based and non-code based) to the
context for making decisions about incorporating new features or
community over a sustained period of time and after being voted
modifying existing implementations [24].
in by Lucene’s Project Management Committee (PMC) in
recognition of these contributions [3].

22
7. RETRIEVAL EVALUATION higher order search functionality like more complex joins,
At the time of this writing, the authors are not aware of any grouping, faceting, auto-suggest and spatial search capabilities.
TREC-style evaluations of Lucene 4 (which is not unexpected, as Naturally, there is always work to be done in cleaning up and
it isn’t officially released as of this writing), but Lucene has been refactoring existing code as it becomes better understood.
used in the past by participants of TREC. Moreover, due to As important as the future of the code is to Lucene, so is the
copyright restrictions on the data used in many TREC-style community that surrounds it. Building and maintaining
retrieval evaluations, it is difficult for a widespread open source community is and always will be a vital component of Lucene,
community like Lucene’s to effectively and openly evaluate itself just as keeping up with the latest algorithms and data structures is
using these approaches due to the fact that the community cannot to the codebase itself.
reliably and openly obtain the content to reproduce the results.
This is a somewhat subtle point in that it isn’t that we as a 9. CONCLUSIONS
community don’t technically know how to run TREC-style In this paper, we presented both a historical view of Lucene as
evaluations (many have privately), but that we have decided not to well as details on the components that make Lucene one of the
take it on as a community due to the fact that there is no reliable key pieces of modern, search-based applications in industry today.
way to distribute the content to anyone in the community who These components extend well beyond the code and include an
wishes to participate (e.g. who would sign and fill out the “Always Be Testing” development approach along with a large,
organizational agreement such as open community collectively working to better Lucene under the
http://lemurproject.org/clueweb09/organization_agreement.cluew umbrella that is known as The Apache Software Foundation.
eb09.worder.Jun28-12.pdf for the community?) and therefore it is At a deeper level, Lucene 4 marks yet another inflection point in
not an open process on par with the community’s open the life of Lucene. By overhauling the underpinnings of Lucene
development process. For instance, assume contributor A has to be more flexible and pluggable as well as greatly improving the
access to a paid TREC collection and makes an improvement to efficiency and performance, Lucene is well suited for continued
Lucene that improves precision in a statistically significant way commercial success as well as better positioned for experimental
and posts a patch. How does contributor B, who doesn’t have research work.
access to the same content, reproduce the results and
validate/refute the contribution? See [28] for a deeper
discussion of the issues involved. Some in the community have
10. ACKNOWLEDGMENTS
tried to overcome this by starting the Open Relevance Project The authors wish to thank all of the users and contributors over
(http:lucene.apache.org/openrelevance) but this has yet to gain the years to the Apache Lucene project, with a special thanks to
traction. Thus, it is up to individuals within the community who Doug Cutting, the original author of Lucene. We also wish to
work at institutions with access to the content to perform extend thanks to all of the committers on the project, without
evaluations and share the results with the community. Since most which there would be no Apache Lucene: Andrzej Białecki, Bill
in the community are developers focused on implementation of Au, Michael Busch, Doron Cohen, Doug Cutting, James Dyer,
search in applications, this does not happen publicly very often. Shai Erera, Erick Erickson, Otis Gospodnetić, Adrien Grand,
The authors recognize this is a fairly large gap for Lucene in terms Martijn van Groningen, Erik Hatcher, Mark Harwood, Chris
of IR research and is a gap these authors hope can be remedied by Hostetter, Jan Høydahl, Grant Ingersoll, Mike McCandless, Ryan
working more closely with the research community in the future. McKinley, Chris Male, Bernhard Messer, Mark Miller, Christian
Moen, Robert Muir, Stanisław Osiński, Noble Paul, Steven Rowe,
In the past, some individuals have taken on TREC-style Uwe Schindler, Shalin Shekhar Mangar, Yonik Seeley, Koji
evaluations. In [17], a modified Lucene 2.3.0 was used in the Sekiguchi, Sami Siren, David Smiley, Tommaso Teofili, Andi
1 Million Queries Track. In [29], an unmodified Lucene 3.0, in Vajda, Dawid Weiss, Simon Willnauer, Stefan Matheis, Josh
combination with query expansion techniques, was used in the Bloch, Peter Carlson, Tal Dayan, Bertrand Delacretaz, Scott
TREC 2011 Medical Track. In [30], Lucene 1.9.1 was compared Ganyo, Brian Goetz, Christoph Goller, Eugene Gluzberg,
against a wide variety of open source implementations using out Wolfgang Hoschek, Cory Hubert, Ted Husted Tim Jones, Mike
of the box defaults. The impact of Lucene’s boost and coordinate Klaas, Dave Kor, Daniel Naber, Patrick O'Leary, Andrew C.
level match on tf / idf ranking is studied in [43]. Many researchers Oliver, Dmitry Serebrennikov, Jon Stevens, Matt Tucker, Karl
use Lucene as a baseline (e.g. [44]), a platform for Wettin.
experimentation or an example of implementation of standard IR
algorithms. For example, [45] used Lucene 2.4.0 in an “out of the
box” configuration, although it is not clear to these authors what 11. REFERENCES
an out of the box Lucene configuration is, since the community [1] The Apache Software Foundation. The Apache Software
doesn’t specify such a thing. Foundation. 2012. Accessed 6/23/2012. http://www.apache.org.
[2] Apache License, Version 2.0. The Apache Software Foundation.
8. FUTURE WORK January 2004. Accessed 6/23/2012.
While the nature of open source is such that one never knows http://www.apache.org/licenses/LICENSE-2.0
exactly what will be worked on in the future (“patches welcome” [3] How it Works. The Apache Software Foundation. Circa 2012.
is not just a slogan, but a way of development -- the community Accessed 6/24/2012. http://www.apache.org/foundation/how-it-
often jumps on promising ideas that save time or improve quality works.html
and these ideas often seemingly appear from nowhere.) In [4] Interview with Walter Underwood of Netflix. Lucid Imagination.
general, however, the community focus at the time of this writing May, 2009. Accessed 6/23/2012.
is on: 1) finalizing the 4.0 APIs and open issues for release, 2) http://www.lucidimagination.com/devzone/videos-
additional inverted index compression algorithms (e.g. PFOR) 3) podcasts/podcasts/interview-walter-underwood-netflix
field-level updates (or at least updates for certain kinds of fields [5] Instagram Engineering Blog. Instagram. January 2012. Accessed
like doc-values and metadata fields) and 4) continued growth of 6/23/2012. http://instagram-

23
engineering.tumblr.com/post/13649370142/what-powers-instagram- [25] C. Zhai and J. Lafferty, “A study of smoothing methods for language
hundreds-of-instances-dozens-of models applied to ad hoc information retrieval,” Proceedings of the
[6] Lucene Powered By Wiki. The Apache Software Foundation. 24th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 334–342, 2001.
Various. Accessed 6/23/2012. http://wiki.apache.org/lucene-
java/PoweredBy/ [26] Fabrizio Silvestri and Rossano Venturini. 2010. VSEncoding:
efficient coding and fast decoding of integer lists via dynamic
[7] Apache Solr. The Apache Software Foundation. Accessed
6/23/2012. http://lucene.apache.org/solr. programming. In Proceedings of the 19th ACM international
conference on Information and knowledge management (CIKM '10).
[8] Interview with Doug Cutting. Lucid Imagination. circa 2008.
Accessed 6/23/2012. [27] Douglass R. Cutting and Jan O. Pedersen, Space Optimizations for
Total Ranking, Proceedings of RAIO'97, Computer-Assisted
http://www.lucidimagination.com/devzone/videos-
podcasts/podcasts/interview-doug-cutting Information Searching on Internet, Quebec, Canada, June 1997, pp.
401-412.
[9] Apache Subversion Initial Lucene Revision. The Apache Software
Foundation. 9/18/2001. Accessed 6/23/2012. [28] TREC Collection, NIST and Lucene. Apache Lucene Public Mail
Archives. Aug. 2007. Accessed 6/30/2012.
http://svn.apache.org/viewvc?view=revision&revision=149570
[10] Apache Subversion Lucene Source Code Repository. The Apache [29] B. King, L. Wang, I. Provalov, and J. Zhou, “Cengage Learning at
TREC 2011 Medical Track,” Proceedings of TREC, 2011.
Software Foundation. various. Accessed 6/23/2012.
http://svn.apache.org/repos/asf/lucene/java/tags/ [30] C. Middleton and R. Baeza-Yates, “A comparison of open source
search engines,” 2007.
[11] Apache Subversion Lucene Source Code Repository. The Apache
Software Foundation. various. Accessed 6/23/2012. [31] C. J. van Rijsbergen. Information Retrieval, 2nd edition. 1979,
http://svn.apache.org/repos/asf/lucene/dev/tags/ Butterworths
[12] D. Cutting, J. Pedersen, and P. K. Halvorsen, “An object-oriented [32] The Xapian Project. Accessed 7/2/2012. http://www.xapian.org
architecture for text retrieval,” In Conference Proceedings of
RIAO'91, Intelligent Text and Image Handling, 1991. [33] The Lemur Project. CIIR. Accessed 7/2/2012.
http://www.lemurproject.org

[13] Levenshtein VI (1966)."Binary codes capable of correcting [34] Terrier IR Platform. Univ. of Glasgow. Accessed 7/2/2012.
deletions, insertions, and reversals". Soviet Physics Doklady 10: http://www.terrier.org
707–10. [35] P. Boldi, “MG4J at TREC 2005,” … Text REtrieval Conference
[14] Lucene 1.2 Source Code. The Apache Software Foundation. Circa (TREC 2005). 2005.
2001. Accessed 6/23/2012. [36] V. Anh, A. Moffat. Structured index organizations for high-
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_2_final/ throughput text querying. String Processing and Information
[15] E. Hatcher, O. Gospodnetic, and M. McCandless. Lucene in Action. Retrieval, 304–315, 2006.
Manning, 2nd revised edition. edition, 8 2010. [37] G. Salton, E. A. Fox, and H. Wu. Extended Boolean information
[16] J. Pérez-Iglesias, J. R. Pérez-Agüera, V. Fresno, and Y. Z. Feinstein, retrieval. Commun. ACM 26, 11 (November 1983), 1022-1036
“Integrating the Probabilistic Models BM25/BM25F into Lucene,” [38] S. Mihov , D. Maurel. Direct Construction of Minimal Acyclic
arXiv.org, vol. cs.IR. 26-Nov.-2009. Subsequential Transducers. 2001.
[17] D. Cohen, E. Amitay, and D. Carmel, “Lucene and Juru at Trec [39] J. Daciuk, D. Weiss. Smaller Representation of Finite State
2007: 1-million queries track,” Proc. of the 16th Text REtrieval Automata. In: Lecture Notes in Computer Science, Implementation
Conference, 2007. and Application of Automata, Proceedings of the 16th International
[18] A Language Modeling Extension for Lucene. Information and Conference on Implementation and Application of Automata,
Language Processing Systems. Accessed 6/30/2012. CIAA'2011, vol. 6807, 2011, pp. 118—192.
http://ilps.science.uva.nl/resources/lm-lucene [40] Yves Rasolofo and Jacques Savoy. Term proximity scoring for
[19] D. Cutting and J. Pedersen. 1989. Optimization for dynamic inverted keyword-based retrieval systems. In Proceedings of the 25th
index maintenance. In Proceedings of the 13th annual international European conference on IR research (ECIR'03), Fabrizio Sebastiani
ACM SIGIR conference on Research and development in (Ed.). Springer-Verlag, Berlin, Heidelberg, 207-218, 2003.
information retrieval (SIGIR '90), Jean-Luc Vidick (Ed.). ACM, [41] S. Büttcher, C. Clarke, B. Lushman, B. Term proximity scoring for
New York, NY, USA, 405-411. ad-hoc retrieval on very large text collections. Proceedings of the
[20] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin, 29th annual international ACM SIGIR conference on Research and
“Earlybird: Real-Time Search at Twitter.” development in information retrieval, 621–622, 2006.

[21] S. Robertson, S. Walker, and S. Jones, “Okapi at TREC-3,” NIST [42] A. Moffat, J. Zobel. Fast Ranking in Limited Space. In Proceedings
SPECIAL 1995. of the Tenth International Conference on Data Engineering. IEEE
Computer Society, Washington, DC, USA, 428-437, 1994.
[22] Stephane Clinchant and Eric Gaussier. 2010. Information-based
models for ad hoc IR. In Proceeding of the 33rd international ACM [43] L. Dolamic, J. Savoy. Variations autour de tf idf et du moteur
SIGIR conference on Research and development in information Lucene. In: Publié dans les 1 Actes 9e journées Analyse statistique
retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241 des Données Textuelles JADT 2008, 1047-1058, 2008

[23] G. Amati and C. J. Van Rijsbergen, “Probabilistic models of [44] X. Xu, S. Pan, J. Wan. Compression of Inverted Index for
information retrieval based on measuring the divergence from Comprehensive Performance Evaluation in Lucene. 2010 Third
randomness,” ACM Transactions on Information Systems (TOIS), International Joint Conference on Computational Science and
vol. 20, no. 4, pp. 357–389, 2002. Optimization (CSO), vol. 1, 382–386, 2010

[24] Lucene Trunk Source Code. Revision 1353303. The Apache [45] T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel, “Has adhoc
Software Foundation. 2012. Accessed 6/24/2012. retrieval improved since 1994?,” presented at the SIGIR '09:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/ Proceedings of the 32nd international ACM SIGIR conference on
Research and development in information retrieval, 2009.

24
Galago: A Modular Distributed Processing and Retrieval
System
[A Retrospective]
Marc-Allen Cartright, Samuel Huston, and Henry Feild

Center for Intelligent Information Retrieval


Department of Computer Science
University of Massachusetts
Amherst, MA 01003
irmarc,sjh,[email protected]

ABSTRACT Categories and Subject Descriptors


The open source IR community must address the new needs of the H.3.4 [Information Systems]: Information Storage and Retrieval-
current search engine landscape. While it is still possible for an Systems and Software[Distributed Systems]; H.3.0 [Information
individual to perform effective research or run a small to moderate- Systems]: Information Storage and RetrievalGeneral
sized search engine on a single machine, the scope of search engine
applications has moved far beyond these parameters. The exciting
new frontiers of information retrieval lie now at the extremes: either General Terms
the system and available resources are far more constrained than a Information Retrieval, Distributed Indexing
desktop (as in mobile phones and tablets), or resources are expected
to be available in quantities orders of magnitude larger (as in web-
scale systems). 1. INTRODUCTION
To inform the decisions in designing the next-generation of open The Galago1 search engine is currently being developed at Uni-
source search engines (OSSEs), we present a retrospective assess- versity of Massachusetts Amherst as a generational successor to
ment of the Galago search engine, an open source retrieval sys- Indri.2 Indri emphasized two important factors: 1) the union of
tem developed at the University of Massachusetts Amherst. We language models and inference networks and 2) processing speed.
have successfully deployed Galago over large clusters for both in- This model worked extremely well for its contemporary genera-
dexing and retrieval. At the other end of the spectrum, we have tion of research, and many groups used the software to produce a
also successfully installed Galago on Android-based smart-phones large body of published research. Galago has been designed with
and tablets, providing search capabilities over the personal data — different goals in mind, to react to the next generation of research
tweets, social media posts, blog-feeds, emails, texts, browsing his- needs: 1) interoperation with a distributed processing environment,
tory, etc.— stored on one’s cell phone. and 2) a modular, flexible processing model that allows drop-in
These experiences have provided us with information that we components in virtually every step of the score calculation during
feel is essential to communicate to all potential designers of open retrieval. While we believe Galago has met these goals, to date
source search engines. In this paper, we discuss the aspects of Galago has not received the widespread adoption that Indri has. In
Galago that we believe are worthy of carrying forward into the next this paper we take a look back at our own experiences with Galago
generation of open source retrieval systems. Conversely, we also in an attempt to learn as much as we can about the good, the bad,
discuss the roadblocks encountered, both in terms of adoption by and the hopeful aspects of Galago.
the larger research community and the difficulties in learning to use We present our assessment as follows. When discussing positive
the system effectively. We hope that this retrospective will inform aspects of Galago that we believe should be carried forward to the
the architects of the next generation of open source retrieval sys- next generation of OSSEs, we present it as an affordance3 . We then
tems. discuss issues we encountered when using Galago, as two-part as-
sessments. We begin with a problem statement, which describes
Keywords the specific issue we encountered with the system. We conclude
Galago, TupleFlow, retrieval system, search engine the issue with the lesson that is the general rule or observation we
hypothesize from our specific instance. We hope that this informa-
tion will aid future system implementors by helping them to evolve
Permission to make digital or hard copies of all or part of this work for the nascent affordances we found, and avoid the pitfalls we encoun-
personal or classroom use is granted without fee provided that copies are tered.
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to 1
http://www.lemurproject.org/galago.php
republish, to post on servers or to redistribute to lists, requires prior specific 2
permission and/or a fee. http://www.lemurproject.org/indri/
3
SIGIR 2012 Workshop on Open Source Information Retrieval Wikipedia states an affordance as a quality of an object, or envi-
August 16, 2012, Portland, Oregon, USA ronment, which allows an individual to perform an action. We use
Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00. that definition here.

25
2. AFFORDANCES OF GALAGO the only operator shown is #combine, which performs a linear
Despite the lack of widespread adoption, we believe Galago is a combination of the child nodes of the operator. This may seem
powerful retrieval system that emphasizes several elements that all like a simple act, however, when generalized, the use of operators
future systems would do well to have. We focus on these elements in this recursive manner means that given the proper operators, we
here, and provide evidence in support of each claim. can represent any arbitrarily complex function. In practice, we use
operators to implement smoothing and scoring functions over raw
2.1 Scoring Model Representation terms, combine scores, and implement boolean match operations
Galago continues the use of a tree-based model from Indri, how- (and filtering and negated filtering operations). As we will see in
ever several important changes make Galago’s implementation much the next section, operators can also work in conjunction with traver-
more powerful than Indri’s. The Inference Network model de- sals to perform transformations across the entire tree to represent
scribed by Turtle and Croft [15] and implemented in Indri, provides larger operations.
a clean graphical way of describing a retrieval model. Addition-
ally, it does so in a purely declarative way—the nodes in a query
Traversals
tree describe what they represent, but not how to materialize that A traversal is an operation over the query tree that transforms
information at retrieval time. Indri implemented this framework in the tree in some way. Galago internally uses traversals extensively
a more formalized way by combining the Inference Network with to annotate its query tree to prepare it for processing, check the
Language Models. This proved to be a successful combination, as correctness of submitted queries, optimize query execution [2], and
Indri is still in use as an active research system today, over 8 years rewrite the query tree, to name a few functions.
after its initial development. Operators and traversals are useful in isolation, but when you
However, several issues limit the capabilities of Indri. The query combine them together, you can implement highly expressive lan-
language is difficult to update dynamically, therefore end users are guage constructs in a simple way. As a straightforward example,
limited to the constructs already defined in the language. Addition- Figure 2 depicts the transformation of a small query tree under the
ally, using the Inference Network requires adherence to a proba- #sdm operator. In this case the operator serves as a placeholder
bilistic interpretation of scoring documents. Many retrieval mod- to indicate that the SDM-Traversal to expand the contained query
els do not produce values that can be considered probabilistic (the using the Sequential Dependence Model described by Metzler and
vector space model is an obvious example of this situation). Imple- Croft [9]. The decomposed view of retrieval models afforded by
menting these functions is not feasible without significant change to query trees, in conjunction with operators and traversals, creates a
the code base and a thorough understanding of the scoring pipeline powerful mechanism for implementing retrieval models very effi-
in Indri. ciently. We have additionally used combinations of operators and
Galago solves these issues by generalizing away from a spe- traversals to implement the Relevance Model [8], the field-based
cific philosophy to a more general notion of a query tree. The PRMS model [7], BM25 scoring [11] and it’s field-based variant
only restriction in this model is that upon evaluation for a partic- [10], to name a few of the implemented models. Each model was
ular document, the tree reduces to a final value that is produced at simple to implement and test in Galago, and is now part of the stan-
the root node of the tree. Figure 1 shows the simple query hubble dard distribution of the system.
telescope achievements in the query tree representation. The In this way, we can encapsulate a well-defined model in a short-
#combine node at the top, when evaluated, produces a scalar value hand form in the query language. A similar idea, known as options,
based on its parameters and the current document. We now dis- has been a popular notion in the reinforcement learning community
cuss two powerful ideas that form the core of the query tree model: for over a decade [14]. An option is created to encapsulate a chain
operators and traversals. of actions that the system has deemed useful enough to treat as a
primitive action. This allows increasing abstraction as the system
#combine progresses. In a similar fashion, as new operators and traversals are
added to Galago, the query language can grow to include higher-
level concepts as they are deemed useful enough to add.

#feature:dirichlet:mu=1500 2.2 Generalization of Distributed Processing


Galago comes packaged with its own distributed processing sys-
tem, called TupleFlow. TupleFlow can be thought of as a MapRe-
duce system, in that every process can consist of a map or reduce
operation. The most well-known open-source implementation of
MapReduce is Hadoop, maintained now by the Apache Software
Foundation4 . Hadoop has grown to be a field-tested implementa-
tion that has been scaled to clusters of several thousand machines
hubble telescope achievements
to simultaneously support dozens of online users 5 . However, one
place that we considered TupleFlow to far surpass Hadoop MapRe-
duce was the option of multiple inputs and outputs for a processing
Figure 1: A simple query, represented as a tree. The middle
stage. Hadoop has excellent support for single-stream input and
layer of feature nodes in this query tree each convert frequency
outputs to processing stages, but adding even a single extra stream
information about a term into a Dirichlet-smoothed probabil-
as input to the system can prove to be a test of patience. Conse-
ity.
quently, implementing an ordered join of two or more streams, a
popular operation in data processing, is an onerous task even for
Operators experienced Hadoop users.
4
An operator is a function over child nodes in the query tree that http://hadoop.apache.org/mapreduce/
5
can produce a scalar upon evaluation of a document. In Figure 1, http://research.yahoo.com/news/3374

26
#sdm #combine
#combine #combine

SDM-Traversal 0.8 0.05

0.15 #feature:jm #feature:jm

#combine

hubble telescope achievements #feature:jm


#feature:jm #feature:jm #uw #uw

hubble telescope achievements


#od #od

hubble telescope telescope achievements

hubble telescope telescope achievements

Figure 2: The expansion of the Sequential Dependence Model using a traversal.The layer of feature nodes in this query tree each
convert frequency information about a term or a window into a Jelinek-Mercer-smoothed probability.

FileSource

parsing

Postings extractor Sorter

Word counter Sorter


Parser Tokenizer Stopper Stemmer
Fields extractor Sorter

Document name
Sorter
extractor

documentNumbering

Document extentsRenumber collectionLengthStage postingsNumberer


numberer

Document Extent Length Postings


name writer renumber counter renumber

documentLengths writeExtents mainManifest mergeIndex

Document Positions
Extent Manifest
lengths list
list writer writer
writer writer

Figure 3.15. A TupleFlow computation graph for building a traditional, positions-


Figure 3: An example
based of indexing
text index. Smalla boxes
collection using TupleFlow
are steps, large boxes (generated
are stages, by Trevor
and gray Strohman
boxes [12]).
indicate stages that can be replicated.

Conversely, using multiple streams in TupleFlow requires indi- as they are ready. A standard Hadoop implementation would re-
cating the extra connectionseither
in theanconfiguration, and opening the
error checkpoint file or no checkpointquire manual
at all. Thisordering of in
can result these stages, which would typically run
a substan-
stream in the processing stage, which requires only a single func- serially without programmer intervention. After several years of
tion call with the pipe name.tialFigure
time 3savings,
shows especially
the originalwhen developing experience
indexing a new kindwith
of stage. When using
TupleFlow, we allthis
agree that moving towards a
pipeline of Galago. The innermost boxes are steps, which are en- general data processing model is beneficial for code reuse, higher-
feature, users do need to be careful that code changes do not alter the results of the
closed in stages. A single stage is run on a single machine. Shaded level reasoning, and processing. Trends in industry seem to agree
stages are replicated, meaning many instances
computation that hasofalready
the same stage,
been completed.— these ideas are being implemented as well in large-scale data
with different input, are executed at the same time. A full explana-
tion of the pipeline is beyond the scope of this paper. However, one
can immediately see that several distinct stages can execute inde-
pendently provided the prior input stage has completed. TupleFlow
can analyze this dependence graph and execute these stages as soon

27
81
processing systems, such as MR26 , Spark7 and Flume8 . Although Lesson: Thorough documentation of the system is crucial to
TupleFlow has not been widely adopted to date, we are fully aware success of future systems. Successful systems are often accompa-
of its capabilities, and it is clear to us that the idea of a higher-order nied with copious amounts of documentation. A prime example of
distributed processing paradigm is essential to efficient research in this model is the Hadoop MapReduce open-source implementation.
the future of IR. Hadoop MapReduce is a complex system, however numerous indi-
viduals and organizations spent significant effort in documenting
2.3 Pluggable Components the system, both in providing code examples and reference texts
As mentioned previously, one of the original goals while build- explaining the important parts of the system. Without this docu-
ing Galago was to allow users to easily extend the functionality. mentation, it is unclear how many people would have had the spare
While not all components can be easily extended, many can, in- time to learn to use such a sophisticated framework.
cluding parsing new corpus formats, query operators, query tree
traversals, scoring regimes, and stream processing steps with Tu- 3.2 System Performance Analysis Problems
pleFlow. Using Java, developing pluggable components is easy Problem: A VM complicates system performance analysis.
since extensions can be packaged in completely separate archive Indri is written in C++. The system fully compiles from source
(JAR) files. For example, to run Galago with a user defined query to machine code, making runtime execution very fast, and allowing
operator, the extension’s JAR file is placed on the Java class path for direct management of allocated memory. These are clear ad-
and the user tells Galago which class to associate with the operator vantages when a researcher is concerned with system performance.
at run time—and that’s it. As all of the code is developed in Java, However Galago was designed for modularity and extension. C++
we have a guarantee that an external component developed else- is powerful, but it is also a difficult language to master, and may
where will run as intended on any system. In our own experience, even have different behaviors on different machines depending on
external development has often provided an excellent development the architecture and compiler used.
path for new components that we did not yet want to include in the To avoid these problems, Java is the language of choice for Galago.
main distribution. New components are developed and tested dur- In many ways, it proved to be the right choice. At the time C++03
ing research, when code is often not at its best. After the research was in sore need of an update, and any modification of Indri proved
is complete, we can assess the utility of the new components, and to be torturous for any individual not intimate with most of the
decide if we want to include them in the trunk of the source code. code base. Java removed the need for header files and moved the
Often times integration into the main trunk provides a good oppor- focus away from managing explicit pointers to implementing re-
tunity to refactor the code into a more suitable form as well. trieval models and better design of processing algorithms.
Finally, pluggable components allow users to contribute stan- However, several researchers at CIIR have shown interest in ef-
dalone extensions—not patches that need to be applied to the core ficiency of retrieval systems, and Galago has proven to be a diffi-
code base—that can then be made publicly available and used with cult system to deal with in this regard. Several procedures in Java,
other extensions. Likewise, entire distributed processing programs such as auto-boxing of primitives, and automatic garbage collec-
can be made without the need to modify any of the core TupleFlow tion, have significant impact on wall-clock measurement and mea-
code, just as with Hadoop. surement of memory usage. In several instances, we have encoun-
tered large ‘bumps’ in timing data that we later realized was due to
the virtual machine (VM)’s garbage collector performing a sweep.
3. PROBLEMS ENCOUNTERED This kind of systemic incontinence is unacceptable from a systems
In this section, we reflect on some of the issues we encountered measurement perspective.
while developing Galago and the lessons that we learned from the In TupleFlow, the situation is not much better. Shuffling and
experience. Our hope is that these lessons apply not just to Galago, sorting of large streams of data also suffer from overhead incurred
but OSSEs in general. in using the Java VM. For example the immutability of Strings,
and then placement in the permGen memory pool required us to use
3.1 Steep Learning Curve our own string pooling mechanism to avoid exhausting memory too
Problem: Learning to use Galago was often difficult and con- quickly. In a similar example, Hadoop MapReduce provides their
fusing. Several members of the CIIR have used both TupleFlow own implementation of most of the boxed Java primitives in order
and Galago extensively in their research [1, 2, 3, 4, 5, 6, 13]. All to increase serialization and deserialization efficiency.
users find the system useful and can effectively implement new Lesson: Implementation language may inadvertently define a
components to add to the system in a short amount of time, of- system’s emphasis. The case of Galago shows two competing
ten times within an afternoon. Unfortunately, the road to reach this tensions in the research arena; efficiency and systems researchers
point of expertise was long and complicated. The first two users prefer the low-level control afforded by a language such as C++,
spent almost a year learning the nuances of the system before they whereas researchers concerned with retrieval models (including learning-
could effectively use it in research. Later adopters required less to-rank and users of external data sources) tend to prefer work-
time as the early adopters were able to communicate the important ing in higher-level languages, where they can ignore issues such
aspects of the system effectively, saving new users several months as memory-management or compression, and instead focus on the
of fumbling through a labyrinth of code. Had the system been better formulation of their respective scoring functions.
documented, we believe that would have led to the early adopters The choice of Java was relevant at the time due to limitations
saving weeks, if not months, of time learning the details of Galago inherent in C++, and it seemed to provide a release from the purga-
and TupleFlow. tory of managing pointers and complicated inline functions in In-
dri. However this has also come at the price of control over several
6 components of the system, and has made optimization of Galago
http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-
hadoop-0-23/ more difficult. Ultimately, the choice of implementation language
7
http://www.spark-project.org/ should be weighed against the main priority of the system. If you
8
https://cwiki.apache.org/FLUME/ intend to support extensibility and portability, Java is still an obvi-

28
ous choice, as many projects have shown. However if your focus is extensibility was not always as prevalent as we hoped. Many func-
on compression algorithms and indexing strategies, C++ provides tionalities were difficult to add in a clean and modular way, such as
a better platform for development. certain types of operators and index traversals such as passage or
Additionally, new languages, such as Scala9 and Go10 , should be extent retrieval.
considered in future implementations. Although this may cause a Lesson: Make modular extensibility a stronger focus. Retro-
“yet-another-language” issue, new languages are often developed spect also shows that some capabilities involve several axes, each
to address the shortcomings of their predecessors. For example, of which should be designed for extension. Our canonical example
Scala compiles to JVM bytecode, allowing it to use Java compo- is a user wanting to perform phrase-based retrieval over document
nents. Additionally, the syntax of Scala is much less verbose than passages. Passage-scoring requires a change in the semantics of
Java, and it even allows for rapid development of domain-specific what a “document” is, while phrases require knowing the positions
languages. Future implementations of OSSEs may greatly benefit of terms in documents. The interaction of these two concepts pro-
from the added capabilities of newer languages, however the choice vides an interesting implementation challenge; one that would have
of language, in many ways, defines the emphasis of the system be- influenced the design of the original system.
ing built. When a user wants to add functionality to a retrieval system, it
should be possible to do so easily and without modifying the core
3.3 Software Fragility system. That way the core can be updated independently of the
Problem: Backwards incompatibility. Right now systems such extension. Part of the issue we encountered with Galago was not
as Indri and Galago have several backward compatibility issues at having the foresight to make certain components easily extendable.
the index and internal API levels. The standard update cycle for In- The key is to listen to what users want to extend but cannot. Rather
dri and Galago currently suggests you rebuild any index you want than implement the desired functionality into the core, refactor the
to use with the new version, as the old indexes are simply con- targeted component to be more modular and easily extended by
sidered defunct. When TREC collections or corporate collections users.
numbered in the hundreds of thousands, or even into the low mil- Problem: Different environments cause different problems.
lions of documents, this was merely a tedious inconvenience. How- This problem plagued us in two different scenarios: at the dis-
ever, asking an end-user to rebuild a CLUE-sized index as a matter tributed processing, “web-scale” level, and at the highly constrained,
of process may well be unreasonable to many, and may even be “mobile-device” level. We discuss each instance in turn, both of
impossible for those without the necessary resources. Additionally, which lead us to a larger verdict.
changes to the internal API, which mostly affect plug-and-play sys- Distributed indexing and retrieval. There is a large area of re-
tems like Galago, often impact any extensions users have created search that is emerging around distributed indexing and informa-
and renders them useless until they update to the new API. tion retrieval. Information retrieval has long been focused on the
Lesson: Design assuming that change is imminent. The in- problem of sorting and storing huge amounts of textual data, there-
ternal mechanisms that interact with indexes should be capable fore parallel scalability is becoming one of the most important con-
of handling some amount of backwards compatibility. Lucene,11 cerns in an IR system. A key problem in developing a distributed
for example, guarantees that all index file formats are backwards OSSE is that distributed processing environments each make dif-
compatible, preventing users from being forced to re-index collec- ferent assumptions about the resources available to a distributed
tions. In an even larger scale example, protocol buffers, the data process. This means that each assumption that a system makes will
interchange format used most heavily at Google12 , was specifically reduce the number of clusters that can run the software.
designed for changes to occur to the definitions of the generated High-level systems, such as Spark,13 Pig, 14 and Hadoop, to
classes. As long as the changes are only additive, protocol buffers name a few, provide high level interfaces for processing data. They
are guaranteed to be backwards compatible as well. Concerning require that the data is read and streamed through a series of func-
the internal API, the best solution is to establish a standard that is tions provided by the distributed system and user defined functions
sufficiently general such that the details behind the API can change are called for each data element. These systems generally take over
without the need to adjust the API itself, while specific enough to the job generation, submission and control aspects of distributed
allow users the necessary level of control within their extensions. programming. However, the assumptions made in these general
While changes to the API cannot always be avoided, this will at processing systems may not be optimal for an IR system. A sec-
least minimize the impact of small changes. ondary concern is the measurement of parallel performance within
Problem: Difficult to extend. A major drawback of a system systems like these cannot be tightly controlled.
like Indri is the difficulty one encounters when attempting to add Low-level systems, such as Grid Engine15 and Mesos,16 provide
new functionality, such as a state of the art retrieval model. While low level interfaces for running a set of programs on nodes within a
Indri’s C++ implementation allows for tight memory control and cluster. In these systems, users must write code for job generation
fast single-processor retrieval, adding additional functionality re- and control. The effect of node failure is a vital consideration when
quires rooting around the internals, getting your hands dirty, and programming for these low-level systems. The storage of the data
likely hitting many dead ends. What should take an hour can take is also a major consideration for any distributed process. A central-
days or weeks to the user unfamiliar with Indri’s implementation. ized network attached storage can easily become a bottleneck for
This is especially unpleasant given the necessity to explore new large clusters. A distributed file system is more scalable, but can
models in the fast paced world of information retrieval research. As lead to up to a network bottleneck, with up to O(n2 ) simultaneous
we have already mentioned, one of the goals of Galago was to of- communication channels between n running jobs.
fer extensibility. However, in many of the earlier forms of Galago,

9 13
http://www.scala-lang.org/ http://www.spark-project.org/
10 14
http://golang.org/ http://pig.apache.org/
11 15
http://lucene.apache.org/ http://gridscheduler.sourceforge.net/
12 16
http://code.google.com/p/protobuf/ http://incubator.apache.org/mesos/

29
In Galago we use the Tupleflow framework to generate jobs and System Proximity Boolean not
provide submission control. A key problem of this system is that Galago #uw10(a b) #reject(#any(a) b)
it assumes a centralized network attached storage system, which Indri #uw10(a b) #not(a) b
avoids the O(n2 ) blowout of a distributed file system, but can cause Lucene “a b”∼10 -a b
a bottleneck when performing many parallel disk operations. It is Terrier “a b”∼10 -a b
also important to note that Tupleflow’s assumption of job control
makes implementing an interface or job translation layer to high Table 1: An example of the query syntax for finding terms
level distributed systems, such as Spark, Pig, or Hadoop, almost within a given proximity and using boolean negation under dif-
impossible. However, this same assumption allows TupleFlow to ferent retrieval systems.
be easily extended to run on any cluster management software that
allows direct submission of a series of binary or scripted jobs to be
run in parallel. syntax for the common operators across retrieval systems, e.g., for
Mobile phone deployment. When deploying Galago on an An- the operation of searching for a set of ordered terms. However,
droid mobile phone platform, we encountered difficulty in even get- since each system has its own unique capabilities, it is also neces-
ting the system to operate correctly. Due to limitations in resources, sary to allow any unified query language to be extensible.
mobile phones may only offer a subset of the standard API. In prac- While we do not presume to have a solution to this issue now, we
tice this meant that Galago did not have access to the full Java API believe the issue warrants discussion among the participants of the
when installed and executed on the Android JVM. Memory man- OSSE community. Many other communities have greatly benefited
agement and monitoring interfaces were not implemented in many from standardization of the expression of their common concepts,
early versions of the Android JVM. A crucial problem was that the surely the information retrieval community would stand to also gain
Android environment replaces these unsupported API calls with by making a similar move.
no-op commands – this meant that compilation was possible, but
execution would often produce errors from seemingly random, but
4.2 External Data Services
dependent, sections of code. A common theme in recent research is the use of external data
Lesson: Be mindful of environmental assumptions. An OSSE sources in retrieval models. Sites like DBPedia,19 Freebase,20 and
must be careful about the assumptions it makes about the environ- the Open Directory Project21 provide free access to semi-structured
ment it will execute in. Tupleflow’s assumption of a networked data that provides information beyond a solitary indexed collec-
attached storage system directly limits several key parameters of tion. In the upcoming wave of next-generation OSSEs, these data
the distributed processing space, such as 1) the number of parallel sources should be viewed as a persistent service, accessible by any
jobs, as the creation of too many jobs can overload the file server, researcher or client organization. There are obvious advantages to
and 2) the maximum number of concurrent open files, to name a establishing common APIs to make use of these sites as services,
couple. We believe that the best solution needs to appropriately ab- including:
stract job control, data storage and transfer, and failure protection,
to allow for maximum efficient scalability. Less experimental variation. If all researchers had equal ac-
Conversely, when considering environments with limited resources, cess to a set of static data services, then we can exclude potential
many of the decisions that aid the large-scale case are useless, or sources of variance such as differences in data preparation that can
even detrimental, when resources are limited. Libraries and rou- often significantly impact results.
tines must be heavily optimized to squeeze every cycle and byte
possible out of the scarce resources. While we offer no grand- Less repeated work. Currently multiple organizations have to
unifying solution to this scale problem, we know OSSE designers perform their own data acquisition and preparation for different
must always be aware of the possible substrates their system may data services. These processes are often labor intensive, and pre-
be planted in. clude any research involving these data sources. A single point of
access and curation for these services could keep everyone from
repeatedly “reinventing the wheel”.
4. LOOKING FORWARD
Now that we have discussed the perceived advantages and disad- Reduced maintenance burden. Maintaining the API to a sin-
vantages of using Galago, we turn towards “wishlist” items for the gle data source is not itself difficult, but having to keep each of the
next-generation of OSSEs. systems up and running presents a large maintenance overhead for
any organization. In the case of a smaller research group or a start
4.1 Unified Query Language up trying to break into a specific vertical of research, this over-
Each research retrieval system uses its own custom query lan- head may be prohibitive. Spreading the maintenance work over
guage. For example, Indri supports a subset of INQUERY 17 queries several sites reduces the load on any single site, and certainly re-
in addition to several of its own, while Galago borrows from In- duces wasted load due to unnecessary replication of maintenance.
dri, but differs in syntax and allows a more extensible formulation.
Lucene and Terrier 18 each have their own query syntax (although 4.3 Persistent Web-Scale Index
their syntax is quite similar to each other). Table 1 shows some The ClueWeb project22 is a considerable step towards bringing
examples of the syntax used across these OSSEs. The difference in modern-day web-scale collections to information retrieval researchers.
syntax means that a query formatted for Galago will not work with Unfortunately, not all information retrieval researchers can make
Indri, Lucene, or Terrier, causing issues if a user wants to move use of the dataset, as compressed storage alone requires over 7 TB
from one retrieval system to another. One way around the incom- 19
patibility of query languages is to settle on a standard, unified query http://dbpedia.org/About
20
http://www.freebase.com/
17 21
http://www.ushmm.org/helpdocs/inquerylang.htm http://www.dmoz.org/
18 22
http://terrier.org/ http://lemurproject.org/clueweb09.php/

30
of space. On top of storage costs, it is simply not feasible to index 7. REFERENCES
a collection of that magnitude using a single machine. Even with [1] M.-A. Cartright, E. Aktolga, and J. Dalton. Characterizing
enough resources available to process the collection, indexing the the subjectivity of topics. In Proceedings of the 32nd
ClueWeb collection is not a trivial task, and future collections will international ACM SIGIR, SIGIR ’09, pages 642–643, 2009.
only require more time and resources to manage. [2] M.-A. Cartright and J. Allan. Efficiency optimizations for
As an alternative solution, we hope that the OSSE community interpolating subqueries. In Proceedings of the 20th ACM
would be willing to consider a crowdsourced-style solution, where CIKM, CIKM ’11, pages 297–306, 2011.
instead of the same enormous monolithic collection being managed
[3] M.-A. Cartright, H. Feild, and J. Allan. Evidence finding
by each organization individually, instead each organization can be
using a collection of books. In Proceedings of the 4th ACM
responsible for making the some portion of the collection available
workshop on Online books, complementary social media and
to other organizations as a callable API or service. This would
crowdsourcing, BooksOnline ’11, pages 11–18, 2011.
provide the same benefits listed above, and each organization can
[4] J. Dalton, J. Allan, and D. A. Smith. Passage retrieval for
instead focus on providing high reliability to a manageable set of
incorporating global evidence in sequence labeling. In
documents, versus trying to simply complete the indexing process
Proceedings of the 20th ACM CIKM, CIKM ’11, pages
for themselves.
355–364, 2011.
[5] H. Feild, M.-A. Cartright, and J. Allan. The university of
5. CONCLUSIONS massachusetts amherst’s participation in the inex 2011 prove
it track. In S. Geva, J. Kamps, and R. Schenkel, editors,
The open source IR community needs to reach some level of Focused Retrieval of Content and Structure: 10th (INEX
agreement in several key areas in order to move into the next phase 2011), volume 7424 of LNCS. Springer, 2012.
of relevant research. In the past it was sufficient to perform experi-
[6] S. Huston, A. Moffat, and W. B. Croft. Efficient indexing of
ments in an isolated environment, using either a single machine or a
repeated n-grams. In Proceedings of the fourth ACM WSDM,
small cluster of machines specially purposed for the indexing task.
pages 127–136, 2011.
However, if the next generation of open source search systems are
to be relevant to clients and researchers alike, we must consolidate [7] J. Kim and W. B. Croft. Retrieval experiments using
effort towards agreed standards. Towards this effort, we hope our pseudo-desktop collections. In Proceedings of the 18th
experiences with Galago will provide valuable insight in the design CIKM, pages 1297–1306, 2009.
of the next generation of open source search engines. [8] V. Lavrenko and W. B. Croft. Relevance based language
Galago provides three components that we believe should be models. In Proceedings of the 24th SIGIR, pages 120–127,
standard elements of any next-generation open-source retrieval sys- 2001.
tem: 1) A query tree representation of the query language, with op- [9] D. Metzler and W. B. Croft. A markov random field model
erators and traversals that can be applied to the tree and composed for term dependencies. In Proceedings of the 28th annual
in order to produce more complex higher-level functions; 2) inte- international ACM SIGIR, SIGIR ’05, pages 472–479, 2005.
gration with a distributed processing environment, preferably one [10] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25
that allows for high-level operations; and 3) extensibility to the core extension to multiple weighted fields. In Proceedings of the
system. We believe the core of the system should serve as a skele- 13th CIKM, pages 42–49, 2004.
ton for plugging in components that can be used during indexing [11] S. E. Robertson and S. Walker. Some simple effective
and retrieval. It should be simple for an external user, with mini- approximations to the 2-Poisson model for probabilistic
mal knowledge of the internals, to extend the functionality of the weighted retrieval. In Proceedings of the 17th SIGIR, pages
core system. 232–241, 1994.
Over the course of using and developing Galago, we also noted [12] T. Strohman. Efficient Processing of Complex Features for
several issues with the system that, if possible, should be avoided Information Retrieval. PhD thesis, University of
in future OSSE implementations. While the effort to make Galago Massachusetts Amherst, 2007.
“everything to everyone” is admirable, it resulted in many difficul- [13] T. Strohman and W. B. Croft. Efficient document retrieval in
ties that required redesigns of several components of the system, main memory. In Proceedings of the 30th annual
with still more improvements that could be made. We hope im- international ACM SIGIR, SIGIR ’07, pages 175–182, 2007.
plementors of future systems can learn from our experiences, and [14] R. S. Sutton, D. Precup, and S. Singh. Between mdps and
design a software system that addresses each of these issues well semi-mdps: a framework for temporal abstraction in
before they are forced to deal with them. reinforcement learning. Artif. Intell., 112(1-2):181–211,
Finally, we provide a “wish list” of ideas for the OSSE com- Aug. 1999.
munity. While these ideas are lofty, they would work towards the [15] H. Turtle and W. Croft. Evaluation of an inference
benefit of all involved parties, steering the focus away from the ever network-based retrieval model. ACM Transactions on
increasing, but necessary engineering and procedural overhead, and Information Systems (TOIS), 9(3):187–222, 1991.
back towards developing cutting-edge search products and seminal
research.

6. ACKNOWLEDGMENTS
This work was supported in part by the Center for Intelligent In-
formation Retrieval, in part by NSF grant #IIS-0910884, and in part
by NSF grant #CNS-0934322. Any opinions, findings and conclu-
sions or recommendations expressed in this material are those of
the authors and do not necessarily reflect those of the sponsor.

31
A Framework for Bridging the Gap Between Open Source
Search Tools

Madian Khabsa1 , Stephen Carman2 , Sagnik Ray Choudhury2 and C. Lee Giles1,2
1
Computer Science and Engineering
2
Information Sciences and Technology
The Pennsylvania State University
University Park, PA
[email protected], [email protected], [email protected], [email protected]

ABSTRACT searchable. These documents may be found on the local in-


Building a search engine that can scale to billions of docu- tranet, a local machine, or on the Web. In addition, these
ments while satisfying the needs of the users presents serious documents range in format and type from textual files to
challenges. Few successful stories have been reported so far multimedia files that incorporate video and audio. Expe-
[37]. Here, we report our experience in building YouSeer, a diently ranking the millions of results found for a query in
complete open source search engine tool that includes both a way that satisfies the end-user need is still an unresolved
an open source crawler and an open source indexer. Our problem. Patterson [37] provides a detailed discussion of
approach takes other open source components that have these hurdles.
been proven to scale and combines them to create a compre-
hensive search engine. YouSeer employs Heritrix as a web As such, researchers and developers have spent much time
crawler, and Apache Lucene/Solr for indexing. We describe and effort designing separate pieces of the search engine sys-
the design and architecture, as well as additional compo- tem. This lead to the introduction of many popular search
nents that need to be implemented to build such a search engine tools including crawlers, ingestion systems, and in-
engine. The results of experimenting with our framework in dexers. Examples of such popular tools include: Apache
building vertical search engines are competitive when com- Lucene/Solr, Indri, Terrier, Sphinx, Nutch, Lemur, Heritrix,
pared against complete open source search engines. Our and many others. The creators of each tool have introduced
approach is not specific to the components we use, but in- software with multiple features and advantages, but each one
stead it can be used as generic method for integrating search of them has its own limitations. Since the problems faced by
engine components together. designing a crawler are quite different from what an indexer
developer would face, researchers have dedicated projects to
Categories and Subject Descriptors tackling certain parts of the search engine. Furthermore, the
same unit within the search engine (such as indexer) may in-
H.3.3 [Information Storage and Retrieval]: Information
troduce different challenges based on the application need.
Search and Retrieval - search process; H.3.4 [Information
Working on scaling an indexer to support billions of docu-
Storage and Retrieval]: Systems and Software
ments is different than creating a real time indexer, therefore
each use case has lead to a respective solution.
General Terms
Design, Documentation, Performance Few projects aim at building a complete open source search
engine that includes a web crawler, ingestion module, in-
Keywords dexer, and search interface. While complete search engine
Search Engines, Software Architecture, Open Source tools provide all the different pieces to run a search engine,
these pieces tend to be outperformed by task specific open
source search tools when compared against each other based
1. INTRODUCTION on the specific task only. For example, while Apache Nutch
In the past fifteen years, many search engines have emerged provides an entire search engine solution, Heritrix, which
out of both industry and academia. However, very few have is just a web crawler, is more powerful and versatile when
been successful [37]. Those are a number of challenges [28]. compared solely against the Nutch crawler. This observa-
Firstly, the documents need to be collected before they are tion has lead us to consider building a unified framework
where search engine components can be plugged in and out
to form a complete search engine.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are New projects are started everyday to solve a specific prob-
not made or distributed for profit or commercial advantage and that copies lem for search engines, or to to introduce new features. Like-
bear this notice and the full citation on the first page. To copy otherwise, to wise, many projects have been out of support and develop-
republish, to post on servers or to redistribute to lists, requires prior specific ment after being abandoned by the community. The level
permission and/or a fee.
SIGIR 2012 Workshop on Open Source Information Retrieval. August 16, of support and the richness of the features are what usually
2012, Portland, Oregon, USA. determines how prevalent an open source project is. We
The copyright of this article remains with the authors.

32
propose building a search engine framework that is modular 2. RELATED WORK
and component agnostic where different crawlers or indices
can be interchanged as long as they conform to a set of The use of search engines to find content goes back to the
standards. This modularity facilitates plugging components days of library search, where librarians have used and devel-
that satisfies the users need with minimal to no changes of oped the techniques of information retrieval to find content
the framework’s code base. Under such framework, powerful within books. These techniques have been carried out to
components can be included as they mature, and the com- the domain of web search, and enterprise search. Though
munity can focus on advancing specific parts of the search web search added the concept of web crawler, or spider, to
engine without worrying about building a complete search download the documents from the web, which can be used
engine. to discriminate the web search era from the previous era of
librarian search.
In this paper, we demonstrate how to exploit these freely
available tools to build a comprehensive search framework The open source community felt the urge for an alternative
for building vertical and enterprise search engines. We out- to the commercial search engines that are dominating the
line the architecture of the framework, and describe how market. Part of the need was to provide transparent solu-
each component contributes to building a search engine, and tions where users control the ranking of results and made
the standards and communication mechanisms between the sure they have not been manipulated. In addition, licensing
different components. We rely on agreed upon standards to the search services from these commercial search engines can
control the communication mechanisms between the differ- be expensive.
ent parts of the search engine in order to allow maximum
flexibility and reduce coupling between the components. For ht://Dig [7] is one of the early open source search tools which
the crawler, we assume that crawlers will save the crawled was created back in 1995 at San Diego State University.
documents in WARC format, which became the standard The project is not designed to scale for the needs of entire
for web archiving in 2009 [1]. Warc files are compressed files web indexing. Instead it’s designed to index content of few
that contain records about the metadata of the crawled doc- websites, or intranet. ht://Dig supported boolean and fuzzy
ument along with the documents itself. We implement the queries, and had the ability to strip text out of HTML tags.
middleware which is responsible for ingesting the crawled Currently the project is out of support, as the latest release
files, and pass them to the indexer over a REST API [26]. was back in 2004.
REST has become the defacto standard for accessing web
services, and in this case we assume that indices are provid- Apache Luecene [3] is considered one of the very popular
ing a REST API to communicate with the ingestion module search libraries which has been ported to multiple languages.
and with the query interface. This assumption is rational as It was originally written in Java by Doug Cutting back in
most indices end up providing some access mechanism over 2000, and later it became part of the Apache Software Foun-
HTTP to support distribution. dation. Though Lucene is not a search engine itself, but it’s
an indexing library which can be used to perform all the
The framework we are introducing here, YouSeer, not only indexing operations on text files. It can be plugged into any
can be used for building niche search engines but also for search application to provide the indexing functionality.
educational purposes. For the last three years YouSeer has
been piloted in an advanced undergraduate/graduate class Numerous search engines were developed on top of the Lucene
at Penn State designed to give students the ability to build library, including commercial and open source solutions. Nutch
high end search engines as well as properly reason around the [29, 24] was among the early search engines developed on top
various parts and metrics used to evaluate search engines. of lucene. It added a web crawler and a search interface and
Students in this class were tasked with finding a customer used the lucene indexing library to build a comprehensive
within the Penn State community, usually a professor or search engine. Nutch was later added to the Apache Soft-
graduate student, who is in need of a search engine. Students ware Foundation as a sub-project of Lucene. In develop-
then were required to build a complete search engine from ing Nutch, the developers have aimed at creating a scalable
concept to delivery. This included crawling, indexing and tool that can index the entire web. However, the largest
query tuning if necessary. This class has yielded many niche crawl ever reported on Nutch was 100 million documents
based search engines of varying levels of complexity all using [29, 34] despite supporting parallel crawling on different ma-
YouSeer as a software package. The level of technical know chines. In addition, Nutch added link analysis algorithms to
how in the class ranges from beginners with UNIX/Java to its ranking function to take into account the importance of
PhD level students in Computer Science/Engineering and the pages along with the relevancy.
Information Sciences and Technology.
Although Nutch provides rich set of crawling features, many
The rest of this paper is organized as follows. Section 2 dis- other open source crawlers, i.e. Heritrix [5], provide far more
cusses related work and describes other open source search complex and advanced features. For examples, the deciding
engines. Section 3 provides an overview of the architecture rules for accepting a file or rejecting it in the crawling process
of YouSeer, while Section 4 discusses the workflow inside are much powerful in Heritrix than Nutch. The ability to
the framework. In Section 5 we describe the experiments. take check points, pause and resume crawling, and restore
Finally, we conclude and identify areas of future work in the crawling process in case of failure are also advantages
Section 6. of Heritrix that Nutch lacks. In addition, Nutch obeys the
robots exclusion protocol, Robots.txt, and force the users to
obey it without giving them the option ignore it totally or

33
partially. Along with forcing minimum waiting time between ingest documents from the file system, TREC collections,
fetching attempts for files on the same domain, the crawling or archived files in a WARC format. So, the choices for a
process using Nutch may end up being slow. crawler to use with Indri are large.

Another popular distribution of Lucene is Apache Solr [4] Besides indexing web content and intranet documents, some
which provides enterprise search solution on top of Lucene. search engines were developed to index SQL databases. Pro-
Solr provides a RESTful like API on top of Lucene, so that viding full text search for DBMS content is important for
all communication with the index is done over HTTP re- enterprises with large databases, especially when the built
quests which makes Solr embed-able into any application in full text search is not fast enough. Sphinx [15] provides a
without worrying about the implementation. On top of that, full text search solution for SQL databases, as it comes with
Solr provides distributed searching, query caching, spell cor- connectors to many commercial databases. Besides connect-
rections, and faceted search capabilities. But as mentioned, ing to databases, Sphinx may be configured to index content
Solr is another search framework and not a complete search from XML files which were written in specific format. But,
engine solution. The main missing component is a web since Sphinx is aimed at indexing SQL databases, it can’t be
crawler. Though it can be plugged into Nutch as the back- considered as a complete search engine, and rather a SQL
ground indexer instead of Lucene. A query interface and indexing framework.
results page are also missing, as the shipped interface is for
testing purposes only. This is attributed to the fact that At RMIT university, researchers have developed Zettair [22],
Solr is not a standalone application, rather it’s a library or an open source search engine which is written in C. Zettair’s
framework which get plugged into another application. main feature is the ability to index large amounts of text files
[31]. It’s been used to index 426GB of TREC terabyte track
NutchWAX [13] attempts to merge Nutch with the web collection, according to the official documentation page. On
archive extensions such that the solution will search the the flip side, Zettair can only deal with HTML and plain
web archive. Currently it only supports ARC files, though text files, thus lacking the feature of indexing rich media
WARC [1] files (standard format for archiving web content) files. Besides, it assumes the user has already crawled the
are easily converted to ARC. NutchWAX requires Hadoop files, and doesn’t provide any crawler out of the box.
platform [2] to run the jobs on it, as the tasks are imple-
mented using the Map/Reduce paradigm [25]. Swish-e [16] is another open source search engine which is
suitable for indexing small content, less than 1 million doc-
Since Lucene proved to be a scalable indexing framework, uments. It can be used to crawl the web via a provided
many open source search engine adopted it and built so- perl script, and index files from various data-types: PDF,
lutions on top of it. For example, Hounder [6] not only PPT ...etc. But since it can only support up to one million
capitalize on Lucene, but also on some modules of Nutch documents, scalability is not a feature of this search engine.
to provide large scale distributed indexing and crawling ser-
vices. The crawling and the indexing processes are easily As academia continued to contribute to open source search
configured, launched and monitored via GUI or shell script. engines, researchers at the University of Glasgow have intro-
However, the options that can passed to the crawler are lim- duced Terrier [17, 36, 35] as the “first serious answer in Eu-
ited compared to large scale crawlers like Heritrix, as most of rope to the dominance of the United States on research and
the configurations are regular expressions only. These reg- technological solutions in IR” [36]. Terrier provides a flex-
ular expressions are either entered into a simple java GUI, ible and scalable indexing system with implementations of
or appended to the numerous configuration files. The ex- many state-of-the-art information retrieval algorithms and
tendibility of the system is not an easy task as well. models. Terrier has proven to compete with many other
academic open source search engines, like Indri and Zettair,
Another popular search framework is Xapian [21], which is in TREC workshops [36]. To support large scale indexing,
built with C++ and distributed under GPL license. The Terrier uses MapReduce on Hadoop clusters to parallelize
framework is just an indexing library, but it ships with web the process. Interacting with Terrier is made easy for al-
site search application called Omega [14] that can index files most all users by providing both desktop and web interface.
of multiple formats. Though the Xapian framework doesn’t Nevertheless, as the case with many other open source search
contain a crawler, Omega can crawl files on the local files solutions, Terrier doesn’t ship with a web crawler, but it can
system only. be integrated with a crawler that was also developed at the
University of Glasgow: labrador [9].
Researchers at Carnegie Mellon University and University
of Massachusetts Amherst have developed Indri [40, 8] as a MG4J (Managing Gigabytes for Java) [23, 11] is a search
part of the Lemur project [33, 10]. Indri is a language model engine released under GNU lesser general public license.
based search engine that can scale for 50 million documents It’s being developed at the University of Milan, where re-
on a single machine and 500 million documents on a cluster searchers are plugging state-of-the-art algorithms and rank-
of machines. It supports indexing documents from differ- ing functions into it. The package lacks a fully-fledges web
ent languages, and multiple formats including PDF, Word, crawler, and relies on the user to provide the files, but it can
PPT. In addition to traditional IR models, Indri combines crawl files on the file system. Despite the numerous advan-
inference networks along with language models which makes tages of the system, ease of use seems to elude this search
it a unique solution when compared to other frameworks. engine.
However, Indri doesn’t contain a web crawler, and has to be
used along with a 3rd party crawler. Nevertheless, Indri can WebGlimpse [20] is yet another search engine, though it has

34
a different licensing model as it is free for students and open
source projects, but needs to be licensed for any other use.
It’s built on top of Glimpse indexer which generate indices
of very small size compared to the original text (2-4% of
the original text) [30]. WebGlimpse can index files on the
local filesystem, remote websites, or even crawl webpages
from the web and index them. But the crawler has limited
options which makes it incompetent to do a large scale web
crawl.

mnoGoSearch [12] is another open source search engine which


have versions for different platforms including Linux/Unix
and Windows. It’s a databases back ended search engine,
which implements its own crawler and indexer. Since it’s
dependent on the databases in the back end, the database
connectivity may become the bottleneck of the process in
case the number of running threads passed the limit of con- Figure 1: YouSeer Architecture.
current open connections to the database. The configuration
options of the crawler are also limited compared to Heritrix
and Nutch. Besides, indexing rich media files like PDF and The framework is implemented in Java and the interfaces
Doc is not supported internally, though external plugins can are in JSP.
be used to convert these files.

When comparing open source search engines, many aspects 3.1 Crawler
are taken into consideration. These aspects include: com-
A web crawler is a software that downloads documents from
pleteness of the solution in terms of components (example:
the web and stores them locally. The process of downloading
some libraries don’t have crawlers), the scalability of the so-
is sequential where the crawler will extract the outgoing links
lution, the extendibility of the search engine, the supported
from every downloaded document and schedule these links
file types, license restrictions, support of stemming, stop
to be fetched later according to the crawling policy.
words removal, fuzzy search, index language, character en-
coding, providing snippets for the results, ranking functions,
The Internet Archive’s crawler, named Heritrix [32], was
index size compared to the corpus size, and query response
chosen as a web crawler for YouSeer. Heritrix serves as good
time.
example of embedding any web crawler into a search engine
since it dumps the downloaded documents to the hard disk
Many researchers had done work on comparing the perfor-
in the Warc format, which in 2009 became the standard for
mance of multiple open source search engines. Middleton
archiving web content[1]. By default, Heritrix writes the
and Baeza-Yates compared 29 popular open source search
downloaded documents into compressed ARC files, where
engines in [31]. Their comparison considered many of the
each file aggregates thousands of files. Compressing and ag-
aspects mentioned before, along with precision and recall
gregating the files is essential to keeping the number of files
performance results on TREC [18] dataset. However, their
in the file system manageable, and sustaining lower access
analysis is more focused on the indexing section of the search
time.
engine without considering the crawling process at all. In
fact, many of the libraries that they compare are only index-
Heritrix expects the seed list of the crawl job to be entered
ing libraries, and not complete search engines (i.e. Lucene).
as a text file along with another file that defines the crawling
Another experiment on indexing with open source search
policy. Then the crawler will proceed by fetching the URLs
libraries was performed in 2009 on data from Twitter [39].
in the seed list and write them to ARC/Warc files. This
This experiment was conducted on small data which doesn’t
process can be assumed to be the standard workflow of any
test the scalability of the indexer. Similar to the study by
web crawler, thus the integration of Heritrix can be used as
Middleton and Baeza-Yates [31], [39] doesn ot take into con-
example on how to integrate almost any web crawler into a
sideration the crawling task of the search engine.
search engine.

Heritrix provides flexible and powerful crawling options that


make it ideal for multiple focused crawling jobs. These fea-
3. ARCHITECTURE AND IMPLEMENTA- tures include the ability to filter documents based on many
TION deciding rules such as regular expressions, file types, file
size, and override the policies per domain. The ability to
Our design must include the most important parts of a tune the parameters of connection delay, and control the
search engine, a crawler and the index engine. YouSeer’s ar- max wait time along with number of concurrent connections
chitecture is presented in Figure 1. While most of YouSeer’s are advantageous when crawling for a vertical search engine.
components can be substituted with equivalent open source Despite the lack of support for parallel crawling on multiple
components, we describe the architecture and the implemen- instances, Heritrix is continuously being used at the Internet
tation using the components we deploy, without loss of gen- Archive to crawl and archive the web which can be argued
erality of the approach. to be the largest crawl ever to be conducted using an open

35
source crawler. While teaching a search engine class, stu- transition is made possible because all the different distri-
dents have preferred using heritrix over other open source butions of the index (cores, standalone, distributed) provide
crawlers such as Nutch because Heritrix provides an easy to the same RESTful API, and the ingestion module along with
use web interface to run crawling jobs, and for the richness the query interface only care about the server URL without
of the features and the detailed control of the parameters knowing the specific implementation of the server.
which the students can specify. This helped the students
grasp the challenges of crawling the web, while at the same 3.3 Database
time gave them the chance to monitor how the crawl job is
progressing and what parameters they need to change. A search engine occasionally needs to store some information
in a database. Such information can be for reporting pur-
Besides the web crawler, YouSeer implements its own local poses, or needed for performing certain operations. YouSeer
hard drive harvester. This allows it to function as a desktop uses a database for three reasons: (1) keep track of success-
search engine as well. The crawler runs in a breadth-first fully processed Warc/Arc files in order to avoid processing
manner, starting at a certain directory or network location them again, (2) for storing metadata about cached docu-
iterating over the the files and folders inside that folder. This ments, and (3) to log errors during ingestion. YouSeer uses
would complement our assumption in the framework that MySQL as DBMS server, however SQLite was proved to be
crawlers should produce Warc files, as the local file harvester suitable for small to medium level datasets. YouSeer inter-
would enable YouSeer to index documents mirrored onto the acts with the database through JDBC, hence it can adopt
file system by different crawlers that do not produce Warc any DBMS that has a JDBC driver.
files.
3.4 Extractor
3.2 Indexer Search engines need to handle files in multiple formats rang-
ing from simple html pages to files with rich content like
YouSeer adopts Apache Solr for indexing, which provides a Word and PDF along with audio and video. Apache TIKA
REST-like API to access the underlying Lucene index. Deal- [19] empowers YouSeer with the ability to extract metadata
ing with an indexing interface with a RESTful API [26] over and textual content from various file formats such as PDF,
HTTP gives a layer of abstraction to the underlying index- WORD, Power Point, Excel Sheets, MP3, ZIP, and multi-
ing engine, and provides YouSeer with the ability to employ ple image formats. Tika is currently a standalone Apache
any indexing engine as long as it provides a REST API [26]. project that supports standard interface for converting and
Such an API may be built as a wrapper on top of the ex- extracting metadata from popular document formats. While
isting non-web API. Thus, the indexing engine in YouSeer YouSeer ships coupled with Tika, it’s still fairly straightfor-
is just a URL with the operations that the index provides. ward to replace it with other converters as needed.
These operations are typically: index, search, delete, and
optimize. 3.5 Ingestion Module
Besides the native features of Lucene, Solr provides addi- The ingestion module is where the crawled documents get
tional features like faceted search, distributed search, and processed and are held to be indexed. The Warc/Arc files
index replication. All these features combined with the flex- are processed to extract the records containing individual
ibility to modify the ranking function makes a good case for documents and the corresponding metadata. Documents of
adopting Solr as indexer. In addition, Solr is reported to be predefined media types are passed to multiple extraction
able to index 3 billion documents [38]. modules such as PDFBox to extract their textual content
and metadata. The user specifies the mime types she is
YouSeer distribution deploys two instances of Solr, one for interested in indexing by editing a configuration file that
web documents and another one for files crawled from the lists the accepted mime types. The extracted information
desktop. The separation between the instances can be achieved is later processed and submitted to the index. This module
either by having two standalone Solr instances, or two cores also stores the document’s metadata into the database and
deployed on the same instance. Cores are methods for run- keeps track of where the cached copy is stored.
ning multiple indices with different configuration in the same
Solr instance. By default one core is configured to index The ingestion module is designed in a such a way that differ-
content from the web, and the other core is used to index ent extractors operate on the document, after that each ex-
documents on the file system. In the case of multi-core solr, tractor emits extracted fields, if any, to be sent to the index.
users maintain a single Solr instance, while having the abil- By default the Extractor class provides all the out of the box
ity to tune each index independently. This becomes a need extraction and population for the standard fields of the index
as field numbers for web content differs from file-system con- such as title, url, crawl date and others, while CustomEx-
tent. More importantly, the ranking of web documents may tractor is left for the end user to implement. CustomEx-
be far more complicated than ranking file-system content. tractor is called after Extractor giving the user the ability
Since YouSeer is only aware of the URL of the core (a core is to override the extracted fields, and extract new fields. This
treated just like a dedicated index), it can be easily modified approach makes it easy for the users to implement their own
to use a dedicated index instead of a core in case the number information extractors. For example, while building a search
of documents scales beyond what a single core can handle. engine for a newspaper website, the customer asked for pro-
Furthermore, Solr distributed search techniques can be used viding search capability based on the publication date. The
to replace a core when the number of documents grows be- publication date could be extracted from the URL as the
yond the capabilities of a single machine. This seamless newspaper formats its URL as follows:

36
www.example.com/YYYY/MM/DD/article.html

To achieve this, we implemented the CustomExtractor class


of the ingestion module so that it would extract the infor-
mation from the URL and append the extracted date to the
xml file which is to be sent to the indexer.

3.6 Query Interface


YouSeer has two search interfaces basic and advanced that
provide access to the underlying Lucene index. The basic
interface is similar to most search engines where users enter
a simple query term before the relevant links are returned.
The advanced search provides search fields for each search-
able field in the index and allows the users to set the ranking
criterion. Query suggestions, aka auto-complete, are dis-
played for the user while inputting the query terms. These
Figure 2: Cache Architecture
suggestions are generated offline by extracting the word-level
unigram, bigram, and trigram of terms in the index. When
enough query logs are accumulated, they can be used for
Google document on the fly. The feature works on supported
query suggestions instead of the index terms.
formats only, like PDF, Doc, and PPT. This allows users
to quickly view rich media documents without the need to
Furthermore, queries are checked for misspellings using the
download them. Figure 2 shows the workflow of the caching
terms in the index instead of a third party dictionary. This
module.
would be suitable in the case of vertical search engines that
deal with special domains terminologies.
4. WORKFLOW
Since the index is accessed through a REST web service,
the query interface receives the query terms from the user In this section we present an overview of how the whole
and send an HTTP request to the index. The REST API system works. A typical job starts by scheduling a crawl
provides a level of abstraction for the interface to communi- task on Heritrix. First the seed URLs are provided and the
cate with multiple types of indices as long as they provide a rest of the parameters are defined. These parameters include
similar API. the max-hop, max file size limit, max downloaded files limit,
and other crawl politeness and request-delay values. The
Along with the query interface, YouSeer provides an admin crawler proceeds by fetching the seed URLs, extracting their
interface from which users can launch new ingesting jobs and outgoing links, and scheduling these links for an in breadth-
track the progress of previously started jobs. first crawl. As part of the parameter specification, the user
chooses the format by which the crawled results are written
3.7 Documents Caching into. The default is Arc, but other file formats such as Warc
or simply mirroring the fetched documents on a hard drive
Accessing older versions of some documents, or being able are available. Converting the Arc files into Warc format
to view them while their original host is down, is consid- can be accomplished through command line tool. Should
ered an advantage for a search engine. The caching module the user keep the format as Arc, the downloaded documents
keeps track of the different versions that have been crawled are combined and then written to a single compressed ARC
and indexed of a document. As documents are stored into file [32], which is in this case limited to 100MB. Along with
Ward/Arc files, the relative location of the containing Warc every document, Heritrix stores a metadata record in the
file is stored in the index along with the documents fields. compressed file.
In addition to the surrogate file name, the index would con-
tain the offset of the file within the Warc file. The offset is The ingestion module, which is the middleware between the
needed because Warc and Arc files can only be read sequen- crawler and the index, waits for the ARC/WARC files to be
tially. The Warc/Arc files are mounted on virtual directory written and then iterates on all the documents within the
on the web server, therefore they can be accessed over the ARC file processing them sequentially. The ingestion pro-
network allowing them to be located on a different location cess does not necessarily wait for the crawler to terminate,
than the server or the crawler. rather it keeps polling for new files to be written so it can
process them. The middleware extracts the textual content
When the user requests a cached version for an indexed doc- from the HTML pages and the corresponding metadata cre-
ument, the caching module locates the containing Arc/Warc ated by Heitrix. For rich media formats such as Word, PDF,
file and seeks to the beginning of the document’s record read- Power Point, YouSeer converts the document into text us-
ing it and returning content to the user. If the requested file ing Apache TIKA. The output of the middleware is an XML
is not an HTML document, the module can convert DOC, file containing the fields extracted from the documents. The
PDF, PPT, XLS format and other formats into HTML. URL of each document serves as the document ID within
the index.
YouSeer caching module provides integration with Google
Docs preview, so that cached documents can be viewed as a Each ingestion plug-in contributes to building this XML file

37
by appending its result as an XML tag. The URL of the
ARC file, and the offset of the document within the ARC Table 1: Comparison of different parameters for in-
file are appended to the XML file to expedite retrieval. gesting and indexing OpenCourseWare content
Parameter YouSeer Nutch
The resulting XML file from the processing is posted to the # docs 100,000 42806
index. After processing all the documents within a single Size 15 GB 838 MB
ARC file, the middleware commits the changes to the index CPU 0.81% 0.25%
and marks the ARC file as processed. While indexing, the Memory 37.44% 14.37%
word-level n-grams are extracted and added to the query Time in minutes 15 3.35
suggestion module. URL / Second 119 199
MB / Sec 16 3.8
5. EXPERIMENTS
We perform a number of experiments to measure the perfor- This experiments shows how YouSeer framework can in-
mance of our proposed framework. The experiments entail gest larger amount of data per second. And since Heritrix
crawling the web by focusing on a set of seed URLs then crawl jobs can run faster, plugging in different components
processing the crawled documents in the ingestion module of search engines seems to yield faster turn around time and
before they are indexed. larger processing power supporting our idea of utilizing dif-
ferent open source components rather than building all the
In the first experiment, we aim at creating a search engine pieces of a search engine.
for the OpenCoursWare , OCW, 1 courses. We compile a
seed list of 50 English speaking universities and crawl the In another experiment, we crawled for 20 million documents
seeds with Heritrix 1.14.4. We set a limit of 100,000 to the with 50 threads, this job took less than 40 wall clock hours.
maximum number of files that can be downloaded. The One million documents of multiple formats (pdf, html, ppt,
job finished after reaching 100,000 documents in 4 hours doc, etc.) were indexed in less than 3 hours. These experi-
and 17 minutes running 50 threads. We used an out of the ments were conducted on a Dell server with 8 processors, 4
box configuration for Heritrix, and only modified the max cores each and 32 GB RAM, running linux.
number of documents to be fetched. The size of the data
crawled by Heritrix was 15 GB compressed into ARC files.
For comparison, we use Nutch 1.5 to crawl the same seed list.
6. CONCLUSION AND FUTURE WORK
Similar to Heritrix, we used 50 threads for crawling and keep We described the architecture of YouSeer, a complete open
the rest of the configurations to their default values. We source search engine. The approach used for building YouSeer
limited the hops to 5 and specify the topN value at 20,000. can be extended to support constructing powerful search
topN controls the maximum number of pages that can be tools by leveraging other open source components in such
downloaded at each level of the crawl. This should limit the a way that maximizes usability and minimizes redundancy.
entire crawl to roughly 100,000 pages. The job terminated YouSeer a natural fit for vertical search engines and to the
after 9 hours, and downloaded 42806 documents. The size of enterprise search domain. It also serves as pedagogical tool
the entire crawl files was 838 MB, including segments, linkdb, for information retrieval and search engine classes. The ease
and crawldb. We guess that Nutch prioritized crawling small of use and flexibility of modification makes adoptable for
HTML documents over PDF and PPT files. research experiments. Our experiments shows that YouSeer
can be more effective that other complete open source search
For both YouSeer and Nutch, we used Apache Solr 3.6 as an engines in certain scenarios.
indexing engine. We start by running Nutch solr indexer,
and monitor the process using Sysstat [27], which is a pop- We enumerated the list of open source libraries that the sys-
ular tool for monitoring system resources and utilization on tem uses and introduced a middleware to coordinate these
Linux. As Nutch does not allow specifying the number of modules. The current version of YouSeer is hosted on Source-
threads for ingestion, unlike YouSeer, we started the inges- Forge and a virtual appliance box is available for download
tion command and monitored the threads long with mem- to eliminate the installation overhead.
ory and CPU usage through Sysstat. We recorded 16 active
threads ran that under Nutch process during ingestion. The In the future, we plan to introduce modules to parallelize the
entire ingestions and indexing process took 3.35 minutes, processing and take advantage of the MapReduce paradigm.
that is 199 URLs/second and around 3.8 MBs/second. On We also look forward to investigating security models that
the other hand, since YouSeer allows controlling the number would protect the data from being access by unauthorized
of ingestion threads, we used the same number of threads as users. Currently we rely solely on the web server and the
reported by Sysstat. YouSeer middleware took 15 minutes operating system to provide security mechanisms.
to process the 100,000 documents. That is 111 URL/second
and around 16 MBs/second. Table 1 summarizes the results 7. ACKNOWLEDGMENTS
for OCW search engine experiment. The CPU and memory
usage represent the max usage as captured by Sysstat. The We acknowledge partial funding from NSF and the Informa-
machine on which the experiments was ran is Dell worksta- tion Technology Services at the Pennsylvania State Univer-
tions with 2 dual core processors and 4 GB of memory. The sity. We also thank Pradeep Teregowda, Mike Halm, and
CPU usage is normalized by the total number of CPUs. Kevin Kalupson for their contribution.
1
http://www.ocwconsortium.org/

38
8. REFERENCES M. Kimpton. Introduction to heritrix. In 4th
[1] http://www.iso.org/iso/news.htm?refid=Ref1255. International Web Archiving Workshop, 2004.
[2] Apache hadoop. http://hadoop.apache.org/. [33] P. Ogilvie, , P. Ogilvie, and J. Callan. Experiments
[3] Apache lucene. http://lucene.apache.org/. using the lemur toolkit. In In Proceedings of the Tenth
[4] Apache solr. http://lucene.apache.org/solr/. Text Retrieval Conference (TREC-10, pages 103–108,
[5] Heritrix. http://crawler.archive.org/. 2002.
[6] Hounder on google code. [34] C. Olston and M. Najork. Web Crawling. Foundations
http://code.google.com/p/hounder/. and Trends in Information Retrieval, 4(3):175–246,
[7] ht://dig. http://www.htdig.org/. 2010.
[8] Indri homepage. [35] I. Ounis, G. Amati, V. Plachouras, B. He,
http://www.lemurproject.org/indri.php. C. Macdonald, and D. Johnson. Terrier information
retrieval platform. Advances in Information Retrieval,
[9] Labrador homepage.
pages 517–519, 2005.
http://www.dcs.gla.ac.uk/~craigm/labrador/.
[36] I. Ounis, G. Amati, V. Plachouras, B. He,
[10] Lemure homepage. http://www.lemurproject.org.
C. Macdonald, and C. Lioma. Terrier: A High
[11] Mg4j homepage. http://mg4j.dsi.unimi.it/. Performance and Scalable Information Retrieval
[12] mnogosearch homepage. Platform. In Proceedings of ACM SIGIR’06 Workshop
http://www.mnogosearch.org. on Open Source Information Retrieval (OSIR 2006),
[13] Nutchwax. http://archive-access.sourceforge. 2006.
net/projects/nutch/. [37] A. Patterson. Why writing your own search engine is
[14] Omega homepage. hard. Queue, 2(2):48, 2004.
http://xapian.org/docs/omega/overview.html. [38] J. Rutherglen. Scaling solr to 3 billion documents.
[15] Sphinx homepage. http://www.sphinxsearch.com. http://2010.lucene-eurocon.org/slides/
[16] Swish-e homepage. http://swish-e.org/. Scaling-Solr-to-3Billion-Documents_
[17] Terrier homepage. http://terrier.org/. Jason-Rutherglen.pdf. Apache Lucene EuroCon
[18] Text retrieval conference (trec). 2010.
http://trec.nist.gov/. [39] V. Singh. A comparison of open source search engines.
[19] Tika homepage. http://tika.apache.org/. http://zooie.wordpress.com/
[20] Webglimpse homepage. http://webglimpse.net. 2009/07/06/a-comparison-of-open- source-search-
[21] Xapian homepage. http://xapian.org. engines-and-indexing-twitter/, 7 2009.
[22] Zettair homepage. [40] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft.
http://www.seg.rmit.edu.au/zettair. Indri: a language-model based search engine for
[23] P. Boldi and S. Vigna. MG4J at TREC 2005. In E. M. complex queries. Technical report, University of
Voorhees and L. P. Buckland, editors, The Fourteenth Massachusetts Amherst, 2005.
Text REtrieval Conference (TREC 2005) Proceedings,
number SP 500-266 in Special Publications. NIST,
2005. http://mg4j.dsi.unimi.it/.
[24] M. Cafarella and D. Cutting. Building nutch: Open
source search. Queue, 2(2):61, 2004.
[25] J. Dean and S. Ghemawat. MapReduce: Simplified
data processing on large clusters. Communications of
the ACM, 51(1):107–113, 2008.
[26] R. Fielding. Architectural styles and the design of
network-based software architectures. PhD thesis,
University of California, 2000.
[27] S. Godard. Sysstat: utilities for linux.
http://sebastien.godard.pagesperso-orange.fr/.
[28] M. Henzinger, R. Motwani, and C. Silverstein.
Challenges in web search engines. In ACM SIGIR
Forum, volume 36, pages 11–22. ACM, 2002.
[29] R. Khare, D. Cutting, K. Sitaker, and A. Rifkin.
Nutch: A flexible and scalable open-source web search
engine. Oregon State University, 2004.
[30] U. Manber and S. Wu. GLIMPSE: A tool to search
through entire file systems. In Usenix Winter 1994
Technical Conference, pages 23–32, 1994.
[31] C. Middleton and R. Baeza-Yates. A comparison of
open source search engines. In grid-computing,
volume 1, page 1.
[32] G. Mohr, M. Stack, I. Rnitovic, D. Avery, and

39
Towards an Efficient and Effective Search Engine

Andrew Trotman Xiang-Fei Jia Matt Crane


Department of Computer Department of Computer Department of Computer
Science Science Science
University of Otago University of Otago University of Otago
Dunedin, New Zealand Dunedin, New Zealand Dunedin, New Zealand
[email protected] [email protected] [email protected]

ABSTRACT much work is involved. Our indexer is multi-threaded


Building an efficient and effective search engine requires with a unique pipeline methodology. We also imple-
both science and engineering. In this paper, we discuss the mented a memory management subsystem in the in-
ATIRE search engine developed in our research lab, and dexer for fast memory allocation. The indexer also
both the engineering decisions and research questions that supports the merging of multiple indexes into a single
have motivated building ATIRE. index.
• What is the most efficient structure for the in-
Categories and Subject Descriptors verted index? The full structure of the inverted in-
dex is rarely discussed in the literature, previous dis-
H.3.1 [Information Storage and Retrieval]: Content
cussions have only discussed the techniques used for
Analysis and Indexing – Indexing methods; H.3.3 [Information
index representation [34]. In this paper, we discuss
Storage and Retrieval]: Information Search and Retrieval
how we engineered our index structure.
– Search process
• How to search efficiently without sacrificing ef-
General Terms fectiveness? We have been working on the optimisa-
tion of the term-at-a-time approach for query evalua-
Algorithms, Performance
tion and for future work this will be used as a baseline
for comparing various pruning algorithms; and com-
Keywords paring between term-at-a-time and document-at-at-a-
Indexing, Storage, Efficiency, Pruning, Procrastination time processing.
• Does term proximity work? We question whether
1. INTRODUCTION term proximity and phrase searching are effective un-
Information retrieval has been a hot research topic for decad- der current evaluation methodologies.
es due to the need to quickly and accurately answer users’
• Other research questions? There are a number
queries across very large document collections, for example
of other research questions we intend to address in
the web. Building such an efficient and effective search en-
future work: generalisation of our fusion of ranking
gine involves not only science but also engineering. Science
functions such as BM25 and PageRank; an exploration
provides a range of algorithms for fast searching and better
into the juxtaposition between diversity and relevance
ranking, and engineering is required so that systems can be
feedback; and fully distributed indexing and searching.
tuned to their optimal performance.
There are a number of existing search engines, both pro-
prietary and open-source, for example, Google, MG and 2. FAST INDEXING
Apache Lucene. However, we have built a new search en- The experiments and results shown were conducted on a col-
gine called ATIRE from the ground up, to ensure we have a lection of standard collections from both INEX and TREC
fast robust baseline in order to compare new information re- forums, as described in Table 1. The experiments, with the
trieval technologies; and to conduct state-of-the-art informa- exception of ClueWeb09 collections, were conducted on a
tion retrieval research questions. The ATIRE search engine dual quad-core Intel Xeon E5410 2.3GHz, DDR2 PC5300
is a cross-platform search engine — running on Windows, 9GB main memory, Seagate 7200RPM 500GB hard drive,
Linux and Mac OSX — written in C/C++, with tradition- and running Linux with kernel version 2.6.30. The ClueWeb-
ally non-interoperable sections hand-coded to avoid the use 09 collection experiments were performed on an quad cpu
of a third-party abstraction layer. AMD Opteron 6276 2.3GHz 16-core, 512GB PC12800 main
The questions we want to address are: memory, 6x 600GB 10000 RPM hard drives, and running
Linux with kernel version 2.6.32.
• How to build a fast indexer: It is very challenging In order to produce an index quickly, the indexer in ATIRE
to build a fast indexer due to the complexity of how uses several optimisations and a unique pipeline procedure
based on the producer/consumer model.
The copyright of this article remains with the authors.
SIGIR 2012 Workshop on Open Source Information Retrieval. The main optimisation that ATIRE uses when indexing
August 16, 2012, Portland, Oregon, USA. is the use of an internal memory management system that

40
Size Words
Collection Collection Index % Documents Unique Total
Wall Street Journal [14] 517MB 64MB 12.4% 173,252 229,514 84,881,717
WT10g [4] 10GB 837MB 8.4% 1,692,096 5,512,114 1,348,119,626
2009 INEX Wikipedia [29] 50.7GB 1.6GB 3.2% 2,666,190 11,874,077 2,341,271,195
WT100g/VLC2 [16] 100GB 7.1GB 7.1% 20,616,457 25,250,355 12,690,145,498
.gov2 [9] 400GB 12GB 3% 25,205,179 40,641,599 32,573,784,848
ClueWeb09 Category B 1.5TB 32GB 2% 50,220,423 96,298,556 71,319,689,402
ClueWeb09 Category A 12.5TB 503,903,810
(excl. 70% spam) 3.8TB 76GB 2% 150,954,279 127,651,335 189,731,940,667

Table 1: Summary of Collections Used

Memory Manager Indexing Time (mm:ss) Number Input Files


System 14:18 (.tar.gz format) Indexing Time (mm:ss)
ATIRE 10:37 1 19:35
2 11:47
Table 2: Indexing times for INEX 2009 Wikipedia 4 10:37
collection across four .tar.gz files with different
memory managers Table 3: Indexing times for INEX 2009 Wikipedia
collection with varying number of input files

requests large blocks of memory from the system and di- Input Format Indexing Time (mm:ss)
vides this up as necessary, this overhead can be measured .tar 10:35
by compiling without this intermediate manager and using .tar.gz 10:37
the system memory management. This optimisation alone .tar.lzo 10:09
reduces the time taken to index the 2009 INEX Wikipedia .tar.bz2 19:10
collection by one-third, as shown in Table 2, these times are File count 5:40
taken from a single run, with disk caches flushed between Extracted line count 6:54
runs, but are indicative of typical performance. Individual files 64:30
The indexing pipeline that ATIRE uses internally is unique
among open-source search engines. The pipeline consists of Table 4: Indexing times for INEX 2009 Wikipedia
a group of parallel producer/consumer inspired objects that collection under various compression schemes across
either deal with streams of data, or file-like objects that are four files
created from these streams. Each step in the pipeline is
focused on only performing one operation on the passed-
through data, minimising the amount of inspection per- of files increases, and our ability to index these files in paral-
formed at each step. lel increases with it, that the total time to index is decreased.
These different stages in the pipeline can be combined This leads us to believe that our indexer is approaching
quickly and efficiently to allow new types of content to be the point where indexing is bound by decompression time.
indexed. For instance, an existing object that un-tars, and Indexing times for the INEX 2009 Wikipedia collection when
an object that un-gzips can be combined to allow the index- split across four files are shown in Table 4, with the time to
ing of .tar.gz files. index the individual extracted files shown for comparison,
Objects in the pipeline are allowed to perform secondary as a clearly input bound operation (probably by the ability
functions, for instance, compressing the original document to open and close files). We are pleased to notice that we
and including it within the index (for post-processing such have already crossed the point at which we take less than
as focused retrieval and snippet generation). The input twice the time as simply counting the number of files within
pipeline allows the indexer to filter out documents, such the tar file. We also show the total time taken to count the
as those identified as spam, and a best-effort attempt to number of lines in extracted files as a target to aim for. We
clean incoming data to negate any pre-processing of docu- have not yet tuned the number of threads our indexer uses
ment collections that might otherwise be necessary. This against the number of cores in the machine.
is motivated by our underlying philosophy that the indexer The ATIRE search engine defines a word to be a sequence
should be able to index any standard test collection out of of characters or numbers, where a character or number is de-
the box without any pre-processing. termined by the unicode specification. We currently assume
At the end of the pipeline each document is indexed sep- that input is in UTF-8 format, and can process encoding er-
arately and folded into the overall index. This, combined rors that may be encountered such as missing continuation
with the indexing pipeline, allow documents to be indexed bytes. The input is decomposed, normalised and lower-cased
completely in parallel on a single computer. following the unicode specifications. ATIRE supports CJK,
The effect that parallel indexing has on the indexing time and includes chinese segmentation, and has been used in ex-
is shown in Table 3. The times shown are from a single run, periments at NTCIR. Currently ATIRE does not support
with disk caches flushed between experiments, but are typi- entities such as &aacute; and ignores any processing direc-
cal times experienced. This table shows that as the number tives contained within the document, except for comments.

41
ATIRE Search Engine Index File\n\0\0 Impact Docid Docid ... 0 Impact Docid 0 ...
Compressed Collection (Optional) Comrepessed

1 Postings
Postings Lists
Compression Scheme

Term Block
Figure 2: The structure of a postings list

Vocabulary Collection Frequency Position On Disk Impacted Length Suffix Position

Terms in Block Document Frequency Postings Length Max Impact


Prefix Position Prefix Position Prefix Position
Term Root
... ... ... ... ... ...
4 4 4 8 4 4 1 4
Footer ...
Term Suffix Term Suffix Term Suffix ...

Figure 1: The overall structure of the index file


Figure 3: The structure of a vocabulary leaf

ATIRE performs all indexing in memory on a single ma-


chine, although distributed indexing is being investigated. search engine, these values are scaled to 1–255, so that they
To work around this limitation, ATIRE also includes a tool may be stored in one byte.
to combine previously generated indexes with minimal mem- Storing the postings lists in this impact ordered format
ory overhead. By processing each indexed term separately can be thought of as a form of compression, as fewer inte-
the merge tool only requires enough memory to contain the gers need be stored. In the worst case where every impact
merged postings list, and to maintain the first level of the value is used, then in an impact ordered format, D + 510
dictionary structure described below. This allows the in- integers need to be stored (due to scaling, quantisation and
dexing of collections that otherwise could not be indexed on list termination), as opposed to 2D, where D is the number
commodity hardware, for example the ClueWeb09 Category of documents that contain the term.
A collection. We refer to each list of docids for each impact value as a
quantum. Each quantum is stored as a difference encoded
list, and the entire impact ordered list is then compressed.
3. INDEX STRUCTURE The ATIRE search engine is capable of compressing postings
The ATIRE search engine generates a single index file that lists using different compression schemes (carryover 12, elias
consists of multiple distinct sections. By restricting the in- delta, elias gamma, golomb, none, relative 10, sigma, simple
dex to a single file we minimise the likelihood of a user not 9, and variable byte) to minimise the disk space taken by the
having all parts of the index at search time. index. By default, however, ATIRE uses variable byte com-
The first few bytes of the index file contain the string pression. A diagrammatic representation of this structure is
ATIRE Search Engine Index File\n\0\0, so that the file shown in Figure 2.
type can be identified by a person using the command head The third section of the index contains the vocabulary
-n 1. A diagrammatic overview of the index structure is structure that holds all the terms that have been indexed.
shown in Figure 1. By default, the ATIRE search engine uses the embedfixed al-
The first section of the index file is optional, and contains gorithm [18] to store vocabulary terms since this algorithm
the compressed original documents in the collection. This provides a good trade-off between storage space and lookup
feature allows for, among other things, focused retrieval and time. The embedfixed algorithm stores the vocabulary in a
snippet generation. The location of each compressed docu- two level B-tree structure. The first level of which contains
ment is stored in a special term inside the index. the unique first four character prefixes of terms in the vo-
The second section of the index file contains the postings cabulary. Each leaf node in the second layer contains the
lists for each of the terms. Traditionally postings lists are suffixes of those terms that share the common prefix of the
stored as a sequence of hdocid, term frequencyi pairs, or- parent node. In essence, it is a form of front-encoding that
dered by docid. In the ATIRE search engine we instead sort can be searched efficiently.
on term frequency first [26, 27], then for each term frequency As well as storing terms in the vocabulary, a number of
we store the docids in a difference encoded list, terminated variables are also required for each term; they are collection
with a 0. This ordering on term frequencies first is referred and document frequencies used for ranking purposes, loca-
to as impact ordered. tion of the postings list stored on disk and the list length
The ATIRE search engine supports the use of precom- used for retrieval of the postings from disk. These variables
puted quantised impact scores, where instead of storing the are stored in the leaf nodes of the vocabulary B-tree.
term frequency values we instead precompute the RSV for Aside from these variables associated with each term, ex-
each term with respect to each document after indexing is tra variables are introduced for each term: the postings length
complete [1]. holds the compressed length of the postings list; the im-
In order to better compress these numbers, they are quan- pacted length variable stores the number of integers in the
tised into integers using the Uniform method [1], which pre- decompressed postings list. These values allow our decom-
serves the original distribution of numbers, so no additional pression routines to take the form “decompress n integers
decoding is required at query time [23, 1]. In the ATIRE from this pointer”, and by identifying the longest decom-

42
pressed postings lists (which is stored in the file footer), al- Index Search
locate a single buffer for decompression purposes at search Collection Time Time MAP
time. (hh:mm:ss) Per Query
The suffixes for each term inside the leaf node are stored ClueWeb09 Cat. B 4:10:20 11.9s 0.1216
as a null terminated set of strings at the end of the leaf node ClueWeb09 Cat. A 20:30:38 30.7s 0.1028
block. For this reason the suffix position variable identifies (excl. 70% spam)
where in this block of suffixes the suffix for this individual
term begins. The local max impact holds the maximum im- Table 5: Comparison of timings for indexing and
pact value for the term, and is used for early termination searching across the ClueWeb09 collection
and pruning of query evaluation [26, 27]. Figure 3 shows a
diagrammatic layout of these variables and the number of
bytes assigned to them.
As a further space saving we can store the postings lists query term at a time.
directly in the vocabulary structure for terms that occur ei- There are advantages and disadvantages to the two ap-
ther once or twice in the collection. This can be done by proaches; (1) term-at-a-time requires an array of interme-
re-purposing some of the variables in the vocabulary leaves, diate accumulators (one for each document) to hold the
much like a union in the C programming language. How accumulated results between the evaluation of each term,
to process these lists can be determined at run time by ex- while document-at-a-time only needs to hold the top n doc-
amining the document frequency for the term. It not only uments (where n is the number of documents to return).
saves the storage space for the postings, but also eliminates Turtle & Flood [33] state that document-at-a-time is more
the extra storage needed for the impact header and postings cost efficient than term-at-a-time based on the assumption
list header. that the intermediate accumulators are stored on disk. They
The ATIRE search engine has the option of loading in- state that the performance of the two methods would be
dexes completely into memory at search time. In the case equivalent if the accumulators could be stored in memory.
that the index is not loaded completely into memory, the (2) Document-at-a-time requires a random scan of post-
vocabulary root is loaded into memory, and during query ings lists for all the query terms in order to fully evaluate
evaluation is binary searched. The relevant term leaf is a document. This scan takes time especially if all post-
then loaded into memory and binary searched to find the ings lists cannot be held in memory. Skipping [21] and
term details, then the document frequency is checked. This blocking [22] were introduced to allow pseudo-random ac-
technique is a form of pre-fetching [20], and saves an extra cess into postings lists. However, there is an extra overhead
disk seek and read. Further details of this method are to be to build skipping and blocking, and the index size increases.
published at a later date. Broder et al. [6] addressed this random scan problem by
Lastly, the index file contains a footer that contains vari- introducing a new document-at-a-time query processing al-
ables that describe the index, and are used to minimise gorithm called WAND which can smartly skip some unnec-
the number of memory allocations needed when performing essary postings for fast scanning. Ding & Suel [12] further
query evaluations, such as the length of the longest postings extended the WAND algorithm and introduced Block-Max
list. These variables are designed to allow the ATIRE search WAND which can further skip more unnecessary postings.
engine to perform no dynamic memory allocation at search The skipping criteria for both of the algorithms are based
time. Certain variables that are associated with the index on the runtime calculation of current thresholds of the max-
that are known at indexing time, such as whether the im- imum impacts for all query terms. (3) Postings lists for
pact values are pre-calculated RSV scores, are stored within document-at-a-time must be longer because the postings
the index itself as special terms. lists are sorted on doc id and are not impact ordered. (4)
As shown in Table 1, the ATIRE search engine is capa- Intuitively, document-at-a-time is more suitable for conjunc-
ble of producing compact indexes that are a fraction of the tive search while term-at-a-time for disjunctive search.
size of the original collection, the rate of which depends Most criticisms aimed at term-at-a-time approach are to-
largely on the ratio between indexable and non-indexable wards the requirement of the intermediate accumulators and
content. The ClueWeb09 Category A index was constructed the need to sort the accumulators to return the top docu-
with spam filtering set to discard the 70% most spammiest ments. However, we believe that it is more difficult to man-
documents, as suggested by Cormack et. al. [10], with the age the memory for all postings lists and efficiently random
number of documents included in the index shown in brack- scan the postings lists for document-at-a-time. We are not
ets in Table 1. concluding that term-at-a-time is better than document-at-
Table 5 shows some comparisons for indexing and search- a-time, or vice versa. Instead we have built a baseline using
ing times across the ClueWeb09 Category A and B collec- the term-at-a-time approach in the ATIRE search engine
tions. Each index was constructed without quantisation and and will use this baseline to compare and investigate the
searching was performed using a single thread, with none of document-at-a-time approach in future work.
the optimisations discussed below, across queries 101–150. The rest of this section discusses how the issues associate
with the term-at-a-time approach are addressed in ATIRE
for query evaluation.
4. QUERY EVALUATION
There are two main query evaluation methods used in infor- 4.1 Ranking Functions
mation retrieval systems, document-at-a-time and term-at- By default, ATIRE uses a modified BM25 ranking function.
a-time processing. The document-at-a-time approach com- This variant does not result in negative IDF values 1 and is
pletely evaluates one document at a time before moving
1
to the next, while the term-at-a-time approach process one We thank Shlomo Geva for this contribution.

43
defined as: is updated for the accumulator. Second (lines 4 to 12), if
! the number of the current top candidate documents is less
X N (k1 + 1) tftd
RSVd = log
dft
·  
Ld
 than the required (result list < top k), it means the heap is
t∈q k1 (1 − b) + b × + tftd
Lavg not full. A new document (if old value = 0) can be simply
Here, N is the total number of documents, and dft and added to the heap and the corresponding bit is set. When
tftd are the number of documents containing the term t and the heap is full (result list = top k), it is required to build
the frequency of the term in document d, and Ld and Lavg the minimum heap on the heapk array. Third (lines 13 to
are the length of document d and the average length of all 14), if result list is no less than top k and the current doc-
documents. The empirical parameters k1 and b have been ument is marked as set (top bitstring[index]), it means the
set to 0.9 and 0.4 respectively by training on the INEX 2008 document which is already in the top gets updated. Updat-
Wikipedia collection. ing one of the top document could violate the properties of
There are a number of other ranking functions supported the minimum heap. It is necessary to call min update() to
in ATIRE as well, for example: Bose-Einstein GL2, Diver- partially fix the heap. Last (lines 15 to 18), if result list is
gence from randomness, Terrier DPH and DFRee, Language no less than top k and the score is greater than the smallest
Models, and Pregen [30]. score in the heap (which is heapk[0]), it means the docu-
ment, which was not in the top, should now be inserted into
4.2 Pruning the top to replace the smallest score. The bit of the smallest
The processing (decompression and similarity ranking) of document in heap should be unset. The new document is
postings and subsequent sorting of accumulators can be com- inserted into the heap by calling min insert.
putationally expensive, especially when queries contain fre- Instead of repeatedly re-building the minimum heap for
quent terms. Processing of these frequent terms not only update and insertion operations, two special functions are
takes time, but also has little impact on the final ranking implemented for efficiency optimisation. Every time one of
results. Postings pruning is a method to eliminate unnec- the top candidate documents gets updated, the min update()
essary processing of postings and provide partial scores for function is called. It first linearly scans the heapk array to
top-k documents. Postings pruning can be done at either locate the right pointer and then partially traverses down
index time or query time. Pruning at index time reduces the subtree of the pointer for proper update of the mini-
the physical size of the index file [8, 25, 5]. However it is a mum heap. The linear scan is required because the mini-
lossy compression; pruned postings are not kept for access mum heap is not a binary search tree. Every time a new
at query time. document is going to be inserted into the minimum heap,
Pruning at query time does not modify the index, but the min insert() function is called. It first replaces the doc-
prunes postings at run-time during query evaluation. It al- ument with the smallest score and then partially traverses
lows different criteria at query time to be applied to keep down the tree for proper update of the minimum heap.
track of top k documents. A number of pruning methods
have been developed and proved to be efficient [7, 15, 24, Algorithm 2 Heapk Update
21, 32, 27, 1, 31, 17, 19]. ATIRE supports both pruning at Require: index ≥ 0 and score > 0
index time and at query time, and pruning at query time is 1: old value ← get the current value of acc[index]
discussed in this section. 2: acc[index] ← acc[index] + score
In ATIRE, the heapk pruning algorithm [17, 19] is used 3: new value ← get the current value of acc[index]
to keep track of the top-k documents. There are two stages 4: if result list < top k then
in the algorithm. The first stage is the initialisation stage, 5: if old value = 0 then
as shown in Algorithm 1. N is the number of documents 6: heapk[result list] ← address of acc[index]
in the collection. top k is the number of top documents 7: result list ← result list + 1
(specified as a command-line parameter) to be returned. 8: set the bit of top bitstring[index]
result list keeps track of the number of current top can- 9: end if
didate documents during evaluation. acc is the accumulator 10: if result list = top k then
array which hold the intermediate similarity scores for each 11: build the minimum heap on heapk
document. heapk is an array of pointers which will be used 12: end if
by the minimum heap to keep track of current top docu- 13: else if top bitstring[index] is set then
ments. top bitstring is an array of bits (one bit for each 14: min update() to update the heapk
document) to track if the document is marked as one of the 15: else if new value > the value of heapk[0] then
top candidate documents. 16: unset the bit of top bitstring[heapk[0]]
17: min insert(acc[index])
Algorithm 1 Heapk Initialisation 18: set the bit of top bitstring[index]
Require: N > 0 and lower k > 0 19: end if
1: N ← total documents in collection
2: top k = lower k The value of lower k can be specified from command-
3: result list = 0 line, used to tell the heapk pruning algorithm how many top
4: acc ← new array[N ] documents to keep track of.
5: heapk ← new array[N ] The performance of the heapk pruning algorithm was in-
6: top bitstring ← new array[N ] vestigated in INEX 2010 and the results showed that the
algorithm is not only CPU cost efficient but also effective.
The second stage is the update stage, shown in Algo- For details of the experiments and results, see our previous
rithm 2. There are four steps. First (lines 1 to 3), the score work [17].

44
4.3 Accumulator Initialisation Algorithm 4. First, the index of an accumulator is divided to
The term-at-a-time approach uses a number of accumula- locate the logical row of the accumulator. Second, the status
tors, usually as a static array, to hold the intermediate ac- of the row flag is checked and two outcomes can happen; (1)
cumulated results for each document. For large collections, If the flag has a value of 0, the associated accumulators in
there can be a large number of accumulators and it takes the row are initialised and the new value is then added to
time to initialise them. One way to avoid this problem is to the accumulator. (2) If the flag has a value of 1, the new
use few accumulators allocated using dynamic search struc- value can be simply added to the accumulator.
tures [24, 21]. However, dynamic structures require more
memory space for each accumulator. For example, a bal- Algorithm 3 Accumulator Initialisation
anced Red-Black tree structure [11] uses about 20 and 32 Require: width ≥ 2
bytes for each accumulator on 32- and 64-bit architectures 1: N ← total documents in collection
respectively. Compared with only 4 bytes required in a static 2: height ← (N/width) + 1
array, only 20% for 32-bit (12.5% for 64-bit) or less of the 3: init f lags ← new array[height]
total number of accumulators should be allocated, other- 4: initialise init f lags
wise the Red-Black tree structure uses more memory than 5: padding ← (width ∗ height) − N
a static array. 6: acc ← new array[N + padding]
For ATIRE, a new efficient accumulator method has been
developed. It not only keeps tracks of the top candidates
(using the heapk algorithm) but also updates the less im-
portant accumulators. This allows initially low scoring can- Algorithm 4 Accumulator Update
didates be to among the top ones at the final stage. The Require: doc id ≥ 0 and doc id < N
method uses two static arrays. One array is used to hold 1: row ← doc id/width
all accumulators (one for each document) and the other to 2: if init f lags[row] == 0 then
hold a number of flags. Every flag is associated with a par- 3: init f lags[row] ← 1
ticular subset of the accumulators, indicating the initialisa- 4: initialise the row of the accumulators in acc
tion status for that set of accumulators (either initialised 5: end if
or not). Essentially, we turn the one dimensional array of 6: acc[doc id] ← acc[doc id] + new rsv
accumulators into a logical two dimensional table as shown
in Figure 4. The dimension of the table is defined by height
In order to find the optimal solution for the width of the
and width, and the number of the flags is the same as the
table, a mathematical model for the algorithm was described
height of the table.
and a simulation was performed. For a detailed discussion of
the mathematical model, the experiments and results, please
Accumulators
see Jia et al. [19].

4.4 Quantum At a Time


Instead of the traditional approaches of term-at-a-time and
Flags
document-at-a-time, we propose a new query evaluation ap-
0 proach called quantum-at-a-time. Before the start of a query
1
Height 1
evaluation, all the quanta of the query terms are sorted on
0 their impact values so that the highest impact quanta can
be evaluated first and then the next highest, and so on until
some of the remaining quanta cannot cause a change to the
Width top-k documents.
The quantum-at-a-time approach is a mixture between
term-at-a-time and document-at-a-time. This new approach
is similar to score-at-time [2, 3] and Block-Max WAND [12].
The differences are that term ranks are used in score-at-a-
time instead of the impact values to sort the quanta, and a
Figure 4: The representation of the accumulators in block in Block-Max WAND can have postings with different
a logical two dimensional table impact values and Block-Max WAND is for document-at-a-
time processing.
The quantum-at-a-time approach is targeted for efficient
As shown in Algorithm 3, the width of the table has to and effective pruning of postings, and better parallel pro-
be a whole number (at least 2), and the height can be cal- cessing of postings lists on multi-core architectures. We will
culated dynamically by referencing the width and the size discuss this work in future publications once we have com-
of the document collection. Extra accumulators (shown as pleted it.
padding in the algorithm) are used to fill the gaps when the
number of accumulators is not evenly divisible by the height.
The allocation of the extra accumulators are so that we can 5. TERM PROXIMITY
perform block operation on whole rows. The number of ex- ATIRE does not currently support positional indexes. We
tra accumulators required is usually small (the worst case is have built several search engines in the past and our ex-
width − 1). periences suggest that positional indexes are not effective
The update operation for the accumulators is shown in under current academic IR evaluation methods that use a

45
binary relevance model. If the precision improvement of a that are related to different possible interpretations of the
positional index cannot yet be demonstrated in a recognized original query (continuing the above example: the computer
forum such as TREC or INEX then it is difficult to justify company, fruit, record company, etc.) so that each interpre-
having one. tation is given weight according to its likelihood.
Our informal reasoning for this is as follows. If the user One such method for diversification is to cluster the doc-
enters a two word phrase then for a document to contain that uments by topic, and then select documents from clusters
phrase it must also contain both words. For a document to that contain no previously selected documents. Currently
contain that phrase many times it must also contain both the ATIRE search engine does not explicitly diversify re-
those words many times. That is, a document that would sults lists, but this is an active research area for us.
rank highly for the phrase would also rank highly for both When presented as results of clustering documents, rel-
words not as a phrase – and typically they do. evance feedback and diversification are juxtaposed against
Further, examining the precision-at-1 (P@1) score for both each other. Each method uses an opposing criteria to se-
approaches; if phrase searching is more effective than term lect documents, with diversification exploring cluster-space,
searching then a specific set of conditions must be met: (1) and relevance feedback exploiting it. However, both of these
the term search must not put a relevant document at po- ideas seem to improve the results.
sition 1, and (2) the phrase search must do so. Simply re-
placing one relevant document with another has no effect on 7. BUT WAIT THERE’S MORE!
precision; and nor does replacing a non-relevant document
In addition to all the above discussed features, the ATIRE
with another non-relevant document.
search engine also supports: stemming including Krovetz
The circumstances necessary for an improvement are hard
and Porter as well as soundex and metaphone; topsig; snip-
to meet; but we accept that they can be so. If both words
pet generation and focused retrieval.
in the phrase are seen frequently in a document, but never
The ATIRE search engine can natively read assessment
as a phrase, then phrase searching should increase precision
formats, evaluate queries against a large number of metrics,
– in this case the phrase acts as a noise filter. An example
and produce runs for evaluation forums.
of such a query is “The Who”. A second example is when
all the words are seen but not as the phrase. Again the
phrase acts as a filter. An example of this can be seen when 8. CONCLUSION AND FUTURE WORK
searching for the musician “Lisa Lisa” on the Apple iTunes In future work we aim to change the storage format of post-
Store. ings lists in order to allow quantum-at-a-time processing dis-
We believe these examples are pathological and can be cussed earlier in Section 4.4. In order to do this, we need to
handled by storing n-grams in the vocabulary and welcome identify where each quantum is stored within a postings list,
an evaluation forum running a phrase search track. Such an which will require the use of an impact header. This header
experiment was conducted at INEX 2009 but was inconclu- structure is in development and we aim to also include sup-
sive (“competitive, but not superior” [13]). port for incremental index updates.
With the ATIRE search engine, we have tackled optimi-
sation of the term-at-a-time processing approach in several
6. RELEVANCE FEEDBACK AND DIVER- areas; (1) The index structure has been optimised. An im-
SIFICATION pact header is created for each postings list for easy ma-
nipulation of those lists (sorted on impact values). (2) The
The ATIRE search engine currently supports the use of
heapk pruning algorithm is used to keep track of the top-k
pseudo-relevance feedback. Pseudo-relevance feedback makes
documents, thus eliminating the need to sort accumulators.
the assumption that the top n returned documents are rele-
(3) The cost of the accumulator initialisation has been min-
vant to the query and inspects those documents to identify
imised by using the logical two dimensional table.
new and relevant keywords.
In future work, we will use this baseline to compare with
In ATIRE we use the KL-divergence for terms inside these
the document-at-a-time and quantum-at-a-time approaches.
top documents to identify the terms which are more likely to
We will also continue our experiments in relevance feedback,
be used inside these top documents than would be expected
diversification, focused and snippet retrieval in INEX. We
by examining the entire collection. These identified terms
hope one of the evaluation forums will run term proximity
are then added to the original query according to Rocchio’s
in the near future.
algorithm [28]. Terms that were added to the query with
this method are given an equal weighting with, and may
duplicate, the original terms. The ATIRE search engine al- 9. REFERENCES
lows for other methods for term selection to be incorporated, [1] V. N. Anh, O. de Kretser, and A. Moffat.
although currently only KL-divergence is supported. Vector-space ranking with effective early termination.
Although we perform relevance feedback by identifying pages 35–42, 2001.
those terms that are used more frequently in relevant doc- [2] V. N. Anh and A. Moffat. Simplified similarity scoring
uments than one would expect, it can be thought of as a using term ranks. In Proceedings of the 28th annual
result of clustering the documents on topic. Relevance feed- international ACM SIGIR conference on Research and
back promotes those documents that belong to clusters that development in information retrieval, SIGIR ’05,
contain documents that have been identified as relevant. pages 226–233, New York, NY, USA, 2005. ACM.
When a search engine is presented with an ambiguous [3] V. N. Anh and A. Moffat. Pruned query evaluation
query, such as “apple”, then it can employ diversification using pre-computed impacts. In Proceedings of the
in order to help maximise the usefulness of the returned re- 29th annual international ACM SIGIR conference on
sults to the user. Diversification aims to select documents Research and development in information retrieval,

46
SIGIR ’06, pages 372–379, New York, NY, USA, 2006. fast text retrieval. ACM Trans. Inf. Syst.,
ACM. 14(4):349–379, 1996.
[4] P. Bailey, N. Craswell, and D. Hawking. Engineering a [22] A. Moffat, J. Zobel, and S. T. Klein. Improved
multi-purpose test collection for web retrieval inverted file processing for large text databases. pages
experiments. Inf. Process. Manage., 39(6):853–871, 162–171, 1995.
Nov. 2003. [23] A. Moffat, J. Zobel, and R. Sacks-Davis. Memory
[5] R. Blanco and A. Barreiro. Probabilistic static efficient ranking. Inf. Process. Manage.,
pruning of inverted files. ACM Trans. Inf. Syst., 30(6):733–744, 1994.
28(1):1–33, 2010. [24] A. Moffat, J. Zobel, and R. Sacks-Davis. Memory
[6] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and efficient ranking. Inf. Process. Manage.,
J. Zien. Efficient query evaluation using a two-level 30(6):733–744, 1994.
retrieval process. In Proceedings of the twelfth [25] E. S. d. Moura, C. F. d. Santos, B. D. s. d. Araujo,
international conference on Information and A. S. d. Silva, P. Calado, and M. A. Nascimento.
knowledge management, CIKM ’03, pages 426–434, Locality-based pruning methods for web search. ACM
New York, NY, USA, 2003. ACM. Trans. Inf. Syst., 26(2):1–28, 2008.
[7] C. Buckley and A. F. Lewit. Optimization of inverted [26] M. Persin. Document filtering for fast ranking. pages
vector searches. pages 97–110, 1985. 339–348, 1994.
[8] D. Carmel, D. Cohen, R. Fagin, E. Farchi, [27] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered
M. Herscovici, Y. S. Maarek, and A. Soffer. Static document retrieval with frequency-sorted indexes. J.
index pruning for information retrieval systems. pages Am. Soc. Inf. Sci., 47(10):749–764, 1996.
43–50, 2001. [28] J. Rocchio. Relevance feedback in information
[9] C. Clarke, N. Craswell, and I. Soboroff. Overview of retrieval. 1971.
the trec 2004 terabyte track. In Proceedings of TREC, [29] R. Schenkel, F. Suchanek, and G. Kasneci. YAWN: A
volume 2004, 2004. semantically annotated wikipedia xml corpus. March
[10] G. Cormack, M. Smucker, and C. Clarke. Efficient and 2007.
effective spam filtering and re-ranking for large web [30] N. Sherlock and A. Trotman. Efficient sorting of
datasets. Information retrieval, 14(5):441–465, 2011. search results by string attributes. In ADCS ’11, 2011.
[11] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. [31] A. Trotman, X.-F. Jia, and S. Geva. Fast and effective
Introduction to Algorithm. The MIT Press, 1990. focused retrieval. volume 6203 of Lecture Notes in
[12] S. Ding and T. Suel. Faster top-k document retrieval Computer Science, pages 229–241. 2010.
using block-max indexes. In Proceedings of the 34th [32] Y. Tsegay, A. Turpin, and J. Zobel. Dynamic index
international ACM SIGIR conference on Research and pruning for effective caching. pages 987–990, 2007.
development in Information Retrieval, SIGIR ’11, [33] H. Turtle and J. Flood. Query evaluation: Strategies
pages 993–1002, New York, NY, USA, 2011. ACM. and optimizations. Information Processing &
[13] S. Geva, J. Kamps, M. Lethonen, R. Schenkel, Management, 31(6):831 – 850, 1995.
J. Thom, and A. Trotman. Overview of the inex 2009 [34] J. Zobel and A. Moffat. Inverted files for text search
ad hoc track. Focused Retrieval and Evaluation, pages engines. ACM Comput. Surv., 38(2):6, 2006.
4–25, 2010.
[14] D. Harman. Overview of the third text retrieval
conference (TREC-3), volume 500. Diane Pub Co,
1995.
[15] D. Harman and G. Candela. Retrieving records from a
gigabyte of text on a minicomputer using statistical
ranking. Journal of the American Society for
Information Science, 41:581–589, 1990.
[16] D. Hawking, N. Craswell, and P. Thistlewaite.
Overview of trec-7 very large collection track. NIST
SPECIAL PUBLICATION SP, pages 93–106, 1998.
[17] X.-F. Jia, D. Alexander, V. Wood, and A. Trotman.
University of otago at inex 2010. In INEX ’10:
Pre-Proceedings of the INEX. ACM, 2010.
[18] X.-F. Jia, A. Trotman, and J. Holdsworth. Fast search
engine vocabulary lookup. In ADCS ’11, 2011.
[19] X.-F. Jia, A. Trotman, and R. O’keefe. Efficient
accumulator initialisation. In ADCS ’10: Proceedings
of the Fifteenth Australasian Document Computing
Symposium, 2010.
[20] X.-F. Jia, A. Trotman, R. O’Keefe, and Z. Huang.
Application-specific disk I/O optimisation for a search
engine. In PDCAT ’08, pages 399–404, 2008.
[21] A. Moffat and J. Zobel. Self-indexing inverted files for

47
SMART: An open source framework for searching the
physical world

M-Dyaa Albakour, Craig Macdonald Aristodemos Pnevmatikakis


Iadh Ounis John Soldatos
University of Glasgow, UK Athens Information Technology, Greece
{dyaa, craigm, ounis}@dcs.gla.ac.uk {apne, jsol}@ait.gr

ABSTRACT information stemming from both social and sensor networks


User queries are becoming increasingly local where people can be combined. In fact, there is a mutual benefit from the
are interested in what their friends are up to or what is hap- convergence of both sensor networks and social networks.
pening in their local area. Sensors can assist in localised Social networks can benefit from the fact that human activ-
information retrieval by giving the search engine direct ac- ity and intent can be directly derived from sensors, which
cess to events happening in the local world. In this paper, obviates the needs for explicit user input. For example,
we describe an open source framework to search in real- Foursquare1 uses smart phone-based GPS and mapping ser-
time multimedia and social streams. The SMART frame- vices to enable users to track their friends on the social net-
work offers a platform to retrieve information from both work platforms. On the other hand, sensor networks could
the physical world and from people interactions on social start their cooperation in a social way (i.e. based on in-
media. Examples where this framework can be useful in- formation derived from social networks) [4]. For example,
clude “smart cities” where people can have information needs think of the query “I want a good restaurant in a place that
such as ‘what parts of the city has live music on and what is lively now”. To answer this query, we need a system that
do people think about those music events?’. We identify can process information about: (i) How lively a location is
the challenges of building such a framework and the mo- right now. Audiovisual crowd analysis can answer that by
tivations behind releasing it as open source software. The providing metadata from processing the signals of the con-
open architecture of the framework brings about possibili- nected sensors. These signals should be processed in real-
ties for extending it and deploying it in a wide variety of time to provide timely information about the status of the
novel applications. environment in various locations. (ii) How good the vari-
ous restaurants in different areas are. This needs data from
social networks (user-generated content by tagging or “lik-
Keywords ing” good places) and/or from the Linked Data cloud (e.g.
Social search, real-time search, sensor search, smart cities restaurant critics) [7].
In this paper, we present our vision for an open source
framework where information stemming from large-scale
1. INTRODUCTION inter-connected sensors and social network streams can be
The internet has grown in the last decade to connect a indexed in real-time to facilitate searching the physical world.
large number of sensing devices that can monitor the phys-
ical world such as cameras, microphone arrays, or light sen-
sors. The number of sensors connected to the internet is 2. THE SMART FRAMEWORK
magnitudes higher than the number of its users [8]. The The SMART2 (Search engine for MultimediA enviRon-
availability of such connected sensors open opportunities to ment generated contenT) framework aims to provide an in-
collect in real-time the status of the physical world and pro- frastructure where multimedia sensing devices in the phys-
cess this information to develop novel applications in the ical world can be easily used to provide information about
areas of ‘smart’ cities, social networking, surveillance and the status of their environments and make it available in
security. This has triggered the development of tools and real-time for search in combination with information from
techniques for searching sensor data [5, 6]. However, these social networks. The name SMART acknowledges the vi-
methods are still largely based on the indexing and search- sion of “smart cities”.
ing of previously defined (and usually textual) metadata. The architecture of the SMART framework is illustrated
Indeed, while those methods exploit recent advances in sen- in Figure 1, where four layers are identified. At the lowest
sor ontologies [10] in order to decouple the queries from the level (physical) we have the sensing devices that provide the
low-level details of the underlying sensors, they cannot pro- physical world data. The edge node represents the software
vide effective search over arbitrary large and diverse sources layer that processes the raw sensor data to produce meta-
of multimedia data derived from the physical world. data about the environment, which is streamed in real-time
Moreover, with the emergence of social networks such as to the search engine using an appropriate representation
Twitter and Facebook, one can envisage situations where (e.g. RDF). Examples of processing algorithms can include
1
https://foursquare.com/
The copyright of this article remains with the authors. 2
SIGIR 2012 Workshop on Open Source Information Retrieval. http://www.smartfp7.eu
August 16, 2012, Portland, Oregon, USA.

48
Edge Node KB
((Command-Line) Applications
Queries SMART- Linked
External
generated information
Data
collection files cloud

Social
Networks Search Engine ( Edition)
Intelligent Fusion Manager
Manager Social
Networks
Manager Edge
Perceptual Perceptual
… Node
Component Component
…. ConfiguͲ
Edge Node Edge Node Edge Node #1 #N
ration
#1 #2 #N
Manager
Sensor … Sensor
Driver Driver

Figure 1: Architecture of the SMART framework


Figure 2: Edge node components
crowd data analysis for video streams and speech recogni-
tion in audio streams. The search layer collects the streams Tweets in the local area of the sensors [1]). This com-
from the various edge nodes and indexes them in real-time ponent is empowered by a common unified model for the
using an efficient distributed index structure. It also em- metadata of the various feeds, which alleviates their het-
ploys an event detection and ranking retrieval model that erogeneity while also facilitating their management within
uses features identified in the sensor and social streams to the edge node. The approach is similar to that adopted
rank events that are relevant to the user queries. Queries can by the Pachube.com3 platform [3], yet SMART edge nodes
be directly specified or anticipated by the search layer using provide support for much richer metadata. The develop-
contextual information about the user, e.g. the user’s loca- ment and adoption of a common unified metadata model
tion or their social profile. Finally, the uppermost applica- for all SMART feeds ensures the openness and extensibility
tion/visualisation layer offers reusable APIs to develop ap- of the platform in terms of new sensors, perceptual com-
plications that can issue queries to the SMART search engine ponents and social networking feeds. The output of this
and process or visualise the results. component is in the form of continuous metadata streams,
In the next sections, we will discuss in detail the top three which will then need to be processed in real-time. By thresh-
software layers of Figure 1 and the challenges for building olding a single continuous metadata stream, low-level events
each layer. We then describe our vision of the open source are extracted. The moment the threshold is exceeded, the
SMART framework. low-level event is signalled as active, while the moment the
stream receives a smaller value, the low-level event is sig-
3. EDGE NODE LAYER nalled as inactive. The Intelligent Fusion Manager subse-
The edge node is introduced in this section. First, its aims quently undertakes the rule-driven combination of multiple
and challenges are discussed, followed by an example using low-level events (possibly stemming from different sensors or
crowd analysis for metadata extraction. perceptual components). This component integrates sophis-
ticated rule engines that facilitate high-level event recogni-
3.1 Infrastructure Aims tion on the basis of complex rules over multiple low-level
event streams. A rule engine that leverages event calculus
The edge node is the interface of SMART with the phys-
techniques [2] is being integrated. The reasoning component
ical world. Each edge node can cover sensors from a single
leads to a semantic description of the events (i.e. based on
geographic area, e.g. a city block or a public square in the
the RDF format), which is in line with work undertaken
city centre. At the edge node, the signal streams, either from
in the scope of semantic sensor networks [15]. The Linked
physical sensors (e.g. audio/visual streams or environmen-
Data component [7] collects data relevant to the edge node
tal measurements), or from social networks, are processed to
that are available as part of the Linked Data cloud. Hence,
extract events of interest. To achieve this, the design of the
the information available to the search engine is enriched.
edge node is influenced by: (a) state-of-the-art Internet-of-
For example, it can provide the means for identifying geo-
Things platforms, which typically filter, fuse, combine and
locations [13] associated with the target events.
reason over multiple data streams and (b) Linked Data tech-
The edge node stores all types of metadata (the continuous
nologies. The edge node’s architecture is shown in Figure 2.
metadata streams, the low-level and the high-level events)
3.2 Challenges and Research Aims in its Knowledge Base using an XML/JSON interface that
complies with the aforementioned unified data model. The
The data streams collection and processing component edge node also provides a web service API (RESTful API)
provides a uniform way for interfacing to virtually any type to deliver events to the search engine
of sensor and processing. These include perceptual com-
ponents (i.e. the algorithms that attempt to interpret the 3.3 Example Based on Crowd Analysis
streams’ meaning and extract context) running on physi- Crowds are a main source of events for SMART, especially
cal sensors’ feeds (such as crowd analysis running on video in a “smart city” setting where citizens are empowered with
feeds), as well as social networks filters implemented within
3
the Social Networks Manager (e.g. sentiment analysis of Pachube.com is now Cosm.com

49
0.45

0.4
Presentation 1 Presentation 2 Presentation 3 Presentation 4
detected by the crowd analysis algorithms discussed in Sec-
tion 3.2, to perform unusual event detection across multiple
0.35
Crowd density

0.3

0.25

0.2
Raw
Averaged
streams of metadata. The search layer should be capable
0.15 of anticipating user queries depending on their context, e.g.
their location or the time of day. The search layer also offers
0.1

0.05

0
0 20 40 60 80
Time (mins)
100 120 140 160 a web service API (RESTful API) to issue queries and view
results as real-time mashups aggregated from the various
types of processed streams.

4.2 Challenges and Research Aims


One of the main challenges for building the search layer is
the efficient and scalable indexing of continuous metadata
Figure 3: SMART continuous metadata and low-
streams. The search layer is built using the open source
level crowd events as a function of time. Some typ-
Storm6 framework which provides a distributed processing
ically processed frames are also shown.
paradigm, similar to MapReduce, that can handle streams
of data in real-time. We use this architecture to distribute
knowledge about their environment. At a first level of anal- the workload of indexing the streams using Terrier across
ysis, we are interested in quantifying their density, which is multiple processing nodes in a cluster. The index is dis-
based on an adaptive foreground segmentation. This seg- tributed across various shards and an accumulator keeps
mentation is based on a variant of the Stauffer’s algorithm track of the global index statistics. Moreover, Terrier has
[14]. Each pixel in the image is modelled as a Gaussian mix- been enhanced to use real-time, in-memory indices, such
ture (GMM), which is updated using a spatio-temporally that as soon as an update from the edge node is received, it
adapted learning rate [12]. The GMM is then used to de- is indexed, and made available for search. A demonstrator
cide if the pixel belongs to the foreground by adaptively that uses this infrastructure to search Twitter in real-time is
thresholding the accumulated sorted weight of the Gaus- available online.7
sians. Therefore, the foreground mask is formed and is then Another main challenge in the search layer is developing
cleaned-up by shadow removal and a morphological clean- an event retrieval model that can rank ‘interesting’
up. The weighted density of foreground pixels gives the events based on a long-term pattern identification of
crowd density a continuous metadata stream. Using this metadata streams. The event retrieval model can make in-
stream, the low-level events related to crowd appearance ferences on interestingness, based on how unusual an event
or disappearance are easily obtained via thresholding. An is by comparing metadata features, such as the crowd level,
example is illustrated in Figure 3, which shows the crowd observed at a specific location in a specific time to global
density metric as a function of time. This has three large features observed in similar areas at similar times. For ex-
peaks, for which the corresponding processed frames indi- ample, a crowded square in a city on a Friday evening is less
cating the foreground blobs are given. Another frame is an interesting than a crowded narrow street on a Sunday morn-
example of minor activity. The extracted low level metadata ing as this will be reflected in the background statistics of the
indicating crowded intervals are also shown as labelled hor- model. Moreover, learning-to-rank retrieval approaches [9]
izontal lines. A sample video demonstrating how the crowd that use features from the sensor metadata (e.g. the crowd
analysis works is available online.4 level locally and globally) are applied to facilitate the rank-
ing of all events happening in different locations. In addition
4. SEARCH ENGINE LAYER to features extracted from the sensor metadata streams, tex-
tual evidence from the social networks can be associated to
The search engine layer is discussed in more detail in this
events. The event retrieval model can associate keywords
section, where we identify its aims and the main challenges
to those events. For example, people who tweet about live
of developing this layer.
music in a city’s main square may mention the band’s name
4.1 Infrastructure Aims or the song that is being performed.
The SMART search layer indexes in real-time streams of
updates from edge nodes and social networks. It is built 5. APPLICATION LAYER
using the Terrier5 open source search engine [11] with en- The uppermost layer of the SMART platform (see Fig-
hanced real-time indexing and a scalable distributed archi- ure 1) contains the software applications that can deliver
tecture to handle the large amount of streams. The SMART the real benefits of the framework to end users. The appli-
search layer offers an interface to services and end users to cation layer mainly supports developers who want to create
retrieve ‘interesting’ events and associated relevant posts in Web 2.0 services or smart phone applications that exploit the
the social networks for a given query. While an interest- framework capabilities. For example, the application layer
ing event is a subjective notion that likely depends on the includes open source web applications that offer user inter-
application, the search layer can make inferences on inter- faces to issue queries explicitly, or implicitly using the user
estingness, based on how unusual an event is, and learning context, to the search engine API and receive in real-time
from training examples of interesting events. In other words, up-to-date results (events). In addition, it includes open
the search engine layer uses the short-term event detection source mashups that use the search layer visualisation APIs
that is performed in the edge node, e.g. the low-level events
6
https://github.com/nathanmarz/storm/
4 7
http://www.youtube.com/watch?v=akkyRu68rqE http://www.smartfp7.eu/content/twitter-indexing-
5
http://terrier.org demo

50
to display for example newly-breaking events as real-time a wider spread of this new technology and a wider partici-
balloon pop-ups on a map. pation from the industrial and research communities.

6. OPEN SOURCE VISION Acknowledgments


SMART is designed as an open source framework, exten- Part of this work has been carried out in the scope of the
sible in terms of sensors, multimedia processing components EC co-funded project SMART (FP7-287583). The authors
and event retrieval models. As described in Section 4, the acknowledge contributions from all partners of the project.
main components of the SMART search engine are built
upon the existing Terrier open source information retrieval 8. REFERENCES
platform, allowing for the real-time indexing and retrieval
[1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and
of multiple and massive-scale sensor and social networks
R. Passonneau. Sentiment Analysis of Twitter Data.
streams. The SMART open source framework is designed
In Proceedings of LSM’11, 2011.
to benefit from the power of the open source development
[2] A. Artikis, M. Sergot, and G. Paliouras. Run-Time
philosophy, by enabling application developers and organi-
Composite Event Recognition. In Proceedings of
sations to build new tailored services and products on top
of the SMART open source infrastructure. DEBS’12, 2012.
In particular, SMART will form an open source commu- [3] E. Borden. Pachube Internet of Things “Bill of
nity for sustaining and evolving its components. It adopts Rights”. http://blog.cosm.com/2011/03/pachube-
a crowd-sourcing approach to the deployment of physical internet-of-things-bill-of.html.
sensors, social networking feeds and associated repositories, [4] J. G. Breslin, S. Decker, M. Hauswirth, G. Hynes,
which will become searchable through SMART. Thanks to D. Le Phuoc, A. Passant, A. Polleres, C. Rabsch, and
an open specification for describing data streams, the open V. Reynolds. Integrating Social Networks and Sensor
source framework facilitates prospective information providers Networks. In Proceedings of W3C-FSN’09, 2009.
(including sensor infrastructure providers) to connect and [5] J. Camp, J. Robinson, C. Steger, and E. Knightly.
contribute edge nodes and data feeds (as described in Sec- Measurement Driven Deployment of a Two-tier Urban
tion 3) to the SMART search engine. Hence, SMART is Mesh Access Network. In Proceedings of MobiSys’06,
designed to integrate a variety of community-based sensor 2006.
feeds contributed by third parties such as smart cities, sen- [6] D. Guinard and V. Trifa. Towards the Web of Things:
sor deployers and individuals. For example, algorithms that Web Mashups for Embedded Devices. In Proceedings
analyse the level of water pollution can be implemented for of WWW’09, 2009.
the corresponding sensors and made available for the fishing [7] T. Heath and C. Bizer. Linked Data: Evolving the Web
industry. Likewise, the SMART open source infrastructure into a Global Data Space. Morgan & Claypool, 2011.
supports virtual sensors streams such as data feeds stem- [8] A. Jeffries. A Sensor In Every Chicken: Cisco Bets on
ming from social networks (including filters over social net- the Internet of Things. ReadWriteWeb, 2009.
works such as Twitter). In this case, SMART allows social [9] Tie-Yan Liu. Learning to Rank for Information
sensors (e.g. gender analysis or sentiment analysis filters on Retrieval. Foundations and Trends in Information
Twitter) to be used while developing applications for smart Retrieval, 3(3):225–331, 2009.
cities. Finally, SMART adopts the business friendly MPL [10] D. O’Byrne, R. Brennan, and D. O’Sullivan.
2.0 (Mozilla Public License) in order to facilitate service in- Implementing the Draft W3C Semantic Sensor
tegrators to build custom search applications in response Network Ontology. In Proceedings of PERCOM
to specific business requirements of their customers (e.g. Workshops, 2010.
surveillance applications). The open source application layer [11] I. Ounis, G. Amati, V. Plachouras, B. He,
makes it easier for such services to be rapidly implemented. C. Macdonald, and C. Lioma. Terrier: A High
In this way, SMART intends to support both a public crowd- Performance and Scalable Information Retrieval
sourcing paradigm and a private enterprise-related one. The Platform. In Proceedings of ACM SIGIR-OSIR’06,
first release of SMART is planned for the end of 2012. 2006.
[12] A. Pnevmatikakis and L. Polymenakos. Robust
7. CONCLUSIONS Estimation of Background for Fixed Cameras. In
We introduced an open source unified framework that al- Proceedings of CIC’06, 2006.
lows the real-time indexing and retrieval of sensor and so- [13] C. Stadler, J. Lehmann, K. Höffner, and S. Auer.
cial streams. The framework bridges the gap between so- LinkedGeoData: A Core for a Web of Spatial Open
cial and sensor networks and brings them closer together. Data. Semantic Web Journal, 3(4), 2012.
The framework is currently being developed as part of the [14] C. Stauffer and W. E. L. Grimson. Learning Patterns
EC co-funded project SMART. We presented the research of Activity Using Real-Time Tracking. IEEE
challenges for implementing the various components of the Transactions on Pattern Analysis and Machine
framework. In particular, the main challenges reside in de- Intelligence, 22(8):747–757, 2000.
veloping a uniform interface to the processing algorithms of [15] K. Taylor. Semantic Sensor Networks: The W3C
the sensor streams, the effective real-time indexing of so- SSN-XG Ontology and How to Semantically Enable
cial and sensor metadata streams and the development of Real Time Sensor Feeds. In Proceedings of
efficient and effective event retrieval models. Releasing the SemTech’11, 2011.
framework as open source software and sustaining an open
source community to support it is a strategical decision for

51
First Experiences with TIRA for
Reproducible Evaluation in Information Retrieval

Tim Gollub, Steven Burrows, Benno Stein


Bauhaus-Universität Weimar
99421 Weimar, Germany
<first name>.<last name>@uni-weimar.de

ABSTRACT SWIRL 2012 meeting of 45 information retrieval researchers con-


The verifiability and comparability of computational experiments is sidered evaluation as a “perennial issue in information retrieval”,
a major shortcoming in scientific publications, even at top confer- and that a “community evaluation service” is of specific interest [1].
ences. In recent years, various services emerged that try to address With initiatives such as TREC1 , CLEF2 , and PAN3 , the informa-
this problem by providing a global platform where researchers can tion retrieval research community has established evaluation cam-
upload programs along with experiment results. However, these paigns with great success, with datasets of past campaigns being
platforms are not well accepted, partly due to their inherent top- frequently used for current research.
down character: a single institution prescribes the formats and tech- We see two major limitations of these initiatives that we want
nologies to be used. We argue that a community-wide evaluation to overcome with our open source evaluation platform TIRA (the
platform can evolve only from an ongoing bottom-up effort. “Testbed for Information Retrieval Algorithms”). First, it is obvi-
For the field of information retrieval we have been undertaking ous that the scale of these initiatives cannot address all interesting
concrete steps to launch and foster this idea with TIRA [4]. Here, research questions that arise. To cover the bulk of remaining re-
we present the concept and an implementation of a web-based ex- search questions, a community wide evaluation campaign is needed
perimentation environment that greatly simplifies maintenance and that is supported by convenient open software. Any researcher must
publishing of executable experiments for a research group. TIRA’s be empowered to easily set up and conduct an evaluation initia-
system architecture retains researcher’s full control over their re- tive for a specific task of interest. Second, the annual schedule
search assets; moreover, no constraints with respect to data formats of renowned evaluation initiatives is problematic. In this respect,
or programming technologies are prescribed. We see several rea- Armstrong et al. [2] analyzed the performance results achieved out-
sons for researchers to publish their experiments as a web service side the official TREC initiative on various TREC collections as
with TIRA, namely, to simplify their experiment design and execu- published in SIGIR and CIKM papers from 1998 to 2008. The find-
tion, to gain credibility, and to easily disseminate results. ings showed that the vast majority of these papers are not stream-
This paper reports on experiences from developing TIRA to- lined with the official TREC results, which in turn leads to a series
wards our goal. Design goals are reviewed, existing evaluation of false conclusions in the papers and “improvements that don’t
platforms are analyzed, and the architecture of our current imple- add up”. To avoid the ignorance of existing results, ongoing evalu-
mentation is presented. In particular, we present insights from the ation initiatives are needed that continuously integrate new results
first widespread use of TIRA at the PAN series of international pla- submitted over the Web.
giarism detection competitions in 2012. Altogether, our review is With TIRA, we are developing an open source evaluation plat-
promising: the design decisions underlying TIRA are both power- form where we aim to overcome the limitations stated above [3, 4].
ful and flexible enough to cope with the widely varying program- The decisive feature of TIRA is that the software can be down-
ming preferences of the researchers. loaded by any research group to organize and conduct an evaluation
initiative on their local computing infrastructure. For every experi-
Categories and Subject Descriptors: H.5.3 [Information Sys- ment, TIRA provides a web service through which participants can
tems]: Information Interfaces and Presentation—Group and Orga- submit their algorithms or results at any time. TIRA evaluates new
nization Interfaces submissions automatically by executing the experiment evaluation
Keywords: Open Evaluation, Experiment Management, Result software provided by the evaluation organizers from the command
Dissemination line of the underlying operating system. All experiment results are
stored and indexed in a database, which is queried by the web ser-
1. MOTIVATION vice to display the current results.
John Ioannidis attracted considerable attention in 2005 with his In the remaining sections of the paper, the design goals for TIRA
essay “Why Most Published Research Findings Are False” [5]. are presented and compared to existing experiment platforms in
Ioannidis argues that research findings published in papers are Section 2, whereas in Section 3 we explain the system architec-
likely to be biased towards the approaches of the authors, com- ture of TIRA in detail. In Section 4, we give an experience report
monly because of selective result reporting and unequal parameter of our first significant deployment of TIRA at the PAN plagiarism
tuning efforts. To improve upon this situation, he concludes that detection competition, and we provide lessons learned and future
official evaluation initiatives are needed where researchers regis- recommendations. We then summarize our work in Section 5.
ter their approaches for an objective assessment. In addition, the 1
http://trec.nist.gov
The copyright of this article remains with the authors. 2
http://www.clef-initiative.eu
SIGIR 2012 Workshop on Open Source Information Retrieval. 3
http://pan.webis.de
August 16, 2012, Portland, Oregon, USA.

52
2. DESIGN GOALS AND RELATED WORK iment services that can index datasets with state-of-the-art natural
Our efforts to make the deployment of TIRA as simple and con- language processing technology have the potential to raise the com-
venient as possible led to a set of five design goals that we consider parability of retrieval model research to a higher level. For cluster-
as crucial for its widespread use. The design goals are based on ing and result diversification research, comparability is enhanced
the needs for local instantiation, web dissemination, platform in- by establishing static snapshots of the search results from major
dependence, result retrieval, and peer to peer collaboration. Our search engines regularly. The persistent storage of experiment re-
assessment of existing experimentation frameworks with respect to sults by the experimentation framework is key to achieve this goal.
these goals is depicted in Table 1, which shows that none of these Even if the public release of an experiment service is not desired,
systems fully comply. the framework is still useful if it assumes responsibility for manag-
ing the raw experiment results and making them available across a
Table 1: Assessment of existing experimentation frameworks with re- research team.
spect to our five proposed design goals. 5. Peer to Peer Collaboration. Consider a scenario where a con-
sortium of service providers become renowned gatekeepers for var-
Tool URL Domain 1 2 3 4 5
ious streams of research, and maintain the community-wide reposi-
evaluatIR 1 IR ✕ X X X ✕ tory of state-of-the-art algorithms, datasets, and experiment results
expDB 2 ML ✕ ✕ ✕ X ✕ on their web site. The gatekeepers drive the standardization of data
MLComp 3 ML ✕ X ✕ X ✕ formats and can, by utilizing the retrieval facility, stage competi-
myExperiment 4 any ✕ X X X ✕
5
tions in a semi-automated fashion. A mechanism for connecting the
NEMA IR ✕ X ✕ X ✕ local framework instances to a network of experimentation nodes
TunedIT 6 ML, DM X X ✕ X ✕
7
has to be provided to achieve this scenario. Note that currently none
Yahoo Pipes Web ✕ X ✕ ✕ ✕
of the experimentation platforms implements peer to peer collabo-
1 http://www.evaluatir.org/ 5 http://www.music-ir.org/
ration.
2 http://expdb.cs.kuleuven.be/expdb/ 6 http://www.tunedit.org/
3 http://www.mlcomp.org/ 7 http://pipes.yahoo.com/
4 http://www.myexperiment.org/ 3. SYSTEM ARCHITECTURE
The basic functionality of TIRA is to take a locally executable
program and turn it into a web service. To use TIRA for this pur-
1. Local Instantiation. In case data must be kept confidential,
pose, the software is first downloaded and instantiated on the lo-
the platform must be able to reside with the data, hence the plat-
cal computing infrastructure. System compatibility should not be-
form must be locally installable. Unlike centralized experiment
come an issue here, since we distribute TIRA as an executable Java
platforms like MLComp and myExperiment, local instantiation al-
JAR file.4 For the deployment of new programs, TIRA requires a
lows experiments on sensitive data to be published as a service from
program specification file in JSON format: the ProgramRecord, as
a local host. External researchers can then use the service for com-
shown in Figure 1. In its minimal form, the ProgramRecord com-
parison and evaluation of their own research hypotheses, whilst the
prises (1) a unique name for the program, (2) the generic structure
experiment provider is in full control of the experiment resources.
of the program execution command, and (3) the value range of each
2. Web Dissemination. URLs are definitive identifiers for digital
input parameter that affects the output of the program. An example
resources. If all runs of an experiment are accessible over a unique
of a generic program execution command and its respective input
URL, researchers can conveniently link the results in a paper with
parameter specification is given in Figure 2. In general, more com-
the experiment service used to produce them. Especially for stan-
plex commands are possible that concatenate multiple programs
dard pre-processing tasks or evaluations on private data, such a web
via UNIX-pipes or define parameter substitutions that produce non-
service can become a frequently cited resource. In addition, at-
terminals (further parameters).
tention can be attracted to one’s work through integration of the
Provided with the information in the ProgramRecord, TIRA in-
service into home pages and blog articles. To address the issue of
stantiates and updates all system components that are needed to
digital preservation, URLs should encode all information needed to
establish a web service for the new program. All system com-
recompute a resource, such as program and input parameter speci-
ponents are shown in Figure 1. The operating principle of TIRA
fications, in case stored data is lost.
can be described as two major processes: the front-end process
3. Platform Independence. The sophisticated and varying soft-
dealing with user interaction, and the back-end process dealing
ware and hardware requirements of information retrieval experi-
with program execution. As indicated in the component dia-
ments as well as individual coding preferences of software devel-
gram, the ProgramDatabase takes on a special role in TIRA’s sys-
opers render any development constraints imposed by the experi-
tem architecture, since it links the two processes together. The
mentation framework critical for its success. Ideally, software de-
ProgramDatabase is instantiated for each ProgramRecord individ-
velopers can deploy experiments as a service unconstrained by the
ually, it stores past and pending program runs, and it indexes the
utilized operating system, parallelization paradigm, programming
input parameters of the runs to provide basic retrieval functionality.
language, or data formats. Local instantiation is one key to real-
Note that besides the default local database, TIRA can also connect
ize this goal. Furthermore, the experimentation framework must
to a database on a foreign TIRA instance to accomplish peer-to-
operate as a layer strictly on top of the experiment software and
peer collaboration. The front-end and back-end processes are un-
should use, instead of close intra-process communication such as
affected by this distinction. In the remainder of this section, the
in TunedIT, standard inter-process communication on the POSIX
components of these two processes are described beginning with
level and the file system to exchange information. This way, any
the back-end process first, followed by the front-end process lastly.
running software can be deployed as a web service without internal
The TIRA back-end process involves the ProgramWrapper and
modifications.
ProgramScheduler system components. For each ProgramRecord,
4. Result Retrieval. Especially for computationally expensive
an individual ProgramWrapper is instantiated to query its asso-
retrieval tasks, the maintenance of a public result repository can
4
become a valuable asset of a research group. For example, exper- See http://tira.webis.de for latest TIRA release information.

53
1..n
1..n 1..n
execute create lookup
PROGRAM register
HTTP - lookup TIRA lookup PROGRAM WRAPPER PROGRAM
CLIENT SERVER DATABASE execute SCHEDULER
update update PROGRAM
update
RECORD

Front-end process Back-end process

Figure 1: Component diagram of TIRA. Towards the left, the front-end process dealing with the user-interaction is illustrated. To the right, the
back-end program execution process is shown. Requests are illustrated by arrows and imply a response from the requested component.

p y t h o n myexp . py $param1 $param2 > r e s u l t . t x t


$param1 −> a | b | c
$param2 −> [0 −9]+
Figure 2: BNF grammar for a Python program “myexp” with two input
parameters for execution in TIRA.
ciated ProgramDatabase continuously for pending program runs.
Given that TIRA instances might be equipped with different re-
sources in a collaborative environment, the lookup request sent may
contain constraints with respect to accepted input parameter values.
When a matching program run is received, the ProgramWrapper
registers this at the ProgramScheduler for addition to an execution
queue. The ProgramScheduler keeps a pool of system threads,
which continuously take the next run in the queue and request
its execution. To start the program, the generic command in the
ProgramRecord is substituted with the run-specific values and is Figure 3: Screenshot of a TIRA web page for the PAN competition
called inside a run-specific working directory. During execution, 2012. On the web page, PAN participants specify a dataset and upload
the ProgramWrapper listens on the error output stream and up- their plagiarism detection results. On execute, TIRA runs an evalua-
dates the database with notifications and results, which then be- tion script and displays the performance assessment for the submission.
come available to the front-end process.
This gives TIRA users a convenient means to execute a series of
The TIRA front-end process involves the remaining TiraServer
runs with a single parameter specification. In case a combination
and HttpClient system components. The HttpClient is usually
of parameter values has not been seen before, a new program run is
a web browser controlled by a TIRA user, but also a TIRA in-
created with a pending status and stored in the database. Responsi-
stance may fill this role to communicate with other TIRA in-
bility for the pending run is handed to and executed by the back-end
stances. For each ProgramRecord, the HttpClient can access a process.
web page on the TiraServer via a program-specific URL (e.g.
http://<domain>/program/myexp). A screenshot of a TIRA web
page is given in Figure 3. The TIRA web page features the pro- 4. ANALYSIS OF TIRA AT PAN
gram input parameters as HTML form elements, and offers func- In this section we report on the first deployment of TIRA in an
tionality for retrieving program runs with specific parameter values official evaluation campaign. TIRA has been used as the train-
(Search) and for executing new runs (Execute). The result table at ing and evaluation platform for the “detailed comparison” task of
the bottom contains the current execution status, and the results of the 2012 international PAN plagiarism detection competition.5 The
all executed program runs are displayed. If the value range of an in- competition started with the release of training data in March 2012,
put parameter is specified in the ProgramRecord as an enumeration and officially ended after the evaluation of the participant submis-
(cf. $param1 in Figure 2), the input values are listed in a selection sions in July 2012. For TIRA, its successful deployment in the
box. Otherwise in the case of an intrinsic definition (cf. $param2 challenge constitutes an important milestone and was an excellent
in Figure 2), a text input field is given instead. As a third option, opportunity to analyze the software under realistic conditions.
TIRA allows submission files as input parameters, in which case a The participants of the PAN “detailed comparison” challenge
file upload element is shown to the user. were asked to develop software capable of solving the following
To retrieve specific program runs, the TIRA user can specify task: Given a suspicious document and a potential source document
a subset of the input parameters and submit the HTML form by pair, extract and record all plagiarized passages from the suspicious
clicking the Search button. The TiraServer looks up the database document and the corresponding source passages from the source
and returns a web page with the matching results. For retrieval re- document. Unlike the previous PAN competitions, the participants
quests, the form is submitted using the HTTP GET method, which of 2012 did not submit their detection results on an unlabeled test
means that all form values are encoded into the URL. This URL can set, but instead submitted their software. This strategy allowed a
thus be used for the dissemination of results as discussed in Sec- set of real plagiarism cases subject to non-disclosure to be incorpo-
tion 2. In case all input parameters are populated with valid values, rated into the test set to improve the authenticity of the evaluation.
the execution of the program can also be requested. Note that the In addition, the organizers could evaluate the runtime characteris-
TiraServer handles multiple values for parameters by generating an tics of the submitted approaches for the first time.
independent program run for each possible combination of values. 5
http://pan.webis.de/

54
Two TIRA services were deployed to support the running of the tion software as a TIRA service on our computing infrastructure.
competition: (1) A service to compute performance scores on the For future evaluation initiatives, we aim to develop an automated
training data, and (2) A service for the evaluation of the software program deployment mechanism for TIRA: Participants download
submissions on the private test set. We now describe how TIRA has the evaluation resources for a competition and deploy them on a lo-
been used in each of these settings in the remainder of this section. cal TIRA instance. Once developing and testing on the local TIRA
instance is done, TIRA sends the final ProgramRecord and software
4.1 Training Phase Evaluation Service to the official TIRA evaluation instance, where it is automatically
For the training phase, the organizers released a dataset with deployed and evaluated.
ground truth to be used by the participants to train their approaches.
A TIRA service was provided to evaluate the performance of an 5. SUMMARY
approach using the training set. On the TIRA service web page, Creating fully reproducible and comparable experiments in in-
participants were able to upload their compressed detection results formation retrieval is highly desirable, and various researchers have
and receive the “PlagDet” performance score in return (cf. Fig- pointed out that advances in the state of the art in this field are dif-
ure 3), which combines aspects of precision, recall, and granular- ficult to account without such an achievement. A software service
ity. To compute PlagDet, the compressed submissions were ex- that meets this challenge and that is accepted within the research
tracted and evaluated with a Python implementation of PlagDet. community must provide features such as local instantiation, web
The generic execution command used by TIRA for the eval- dissemination, platform independence, result retrieval, and peer
uation was hence: “unzip -qq -o $det -d det && python to peer collaboration. The TIRA platform addresses these goals
perfmeasures.py -p $truth -d det > scores.txt”. The as a new web service to organize and operationalize specific pro-
parameters starting with a $-symbol were substituted according to grammable tasks runnable on the command line. Recently, TIRA
the data provided in the web page input fields similar to the exam- has been deployed “in the wild” for the PAN series of international
ple described in Section 3 and Figure 2. plagiarism detection competitions. Our preliminary findings are
For the organizers of PAN, the evaluation service provided some positive: even complex evaluations of software submissions can be
feedback about the progress of the participants. In the past com- easily managed, compared, and published. Based on this experi-
petitions, the organizers observed that the majority of participants ence we aim to further develop TIRA towards a convenient tool for
started working seriously only in the few days before the sub- the information retrieval community to conduct evaluation initia-
mission deadline. With the public evaluation service, we hoped tives.
to create an atmosphere where participants were motivated by the
recorded PlagDet scores to date acting as a leader board. One week
prior to the submission deadline, the evaluation service received 12
Acknowledgements
submissions from two of the eleven final participants. Three further The related work and architecture description (Sections 2 and 3)
participants started making submissions in the final week, resulting are reproduced from our paper “TIRA: Configuring, Executing, and
in 38 computed PlagDet scores altogether. The remaining six par- Disseminating Information Retrieval Experiments” [4].
ticipants did not use the training phase evaluation service, and may
have simply elected to evaluate their training results offline. Al- References
though the TIRA service was a useful tool for the participants, we
[1] J. Allan, W. B. Croft, A. Moffat, and M. Sanderson. Frontiers,
learned that further incentives for its usage must be provided to ef-
Challenges, and Opportunities for Information Retrieval: Re-
fectively foster the early tinkering within the competition.
port from SWIRL 2012 The Second Strategic Workshop on In-
4.2 Test Phase Evaluation Service formation Retrieval in Lorne. SIGIR Forum, 46(1):2–32, May
2012.
In the test phase, TIRA was used to organize and conduct the
evaluation of the submitted programs. In total, we received eleven [2] T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. Im-
plagiarism detection programs for evaluation on the hidden test set. provements that don’t add up: Ad-hoc Retrieval Results since
Coincidentally, eleven “external detection” result sets were submit- 1998. In D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and
ted in 2011 [6], suggesting that the submission of software was an J. J. Lin, editors, Proceedings of the Eighteenth ACM Confer-
acceptable demand of the participants. The software received var- ence on Information and Knowledge Management, pages 601–
ied greatly with respect to its size, runtime performance, and pro- 610, Hong Kong, China, Nov. 2009.
gramming language used, and we received submissions for both [3] T. Gollub, B. Stein, and S. Burrows. Ousting Ivory Tower Re-
Windows and Linux operating systems. In this respect, the system search: Towards a Web Framework for Providing Experiments
independence of TIRA has been successfully demonstrated. We as a Service. In B. Hersh, J. Callan, Y. Maarek, and M. Sander-
managed to get all submitted software running, and with the excep- son, editors, Proceedings of the Thirty-Fifth International ACM
tion of one submission, the output files produced were valid. For Conference on Research and Development in Information Re-
each of the submissions, we created a ProgramRecord based on the trieval (to appear), Aug. 2012.
installation manual provided by the participants. Although the pro- [4] T. Gollub, B. Stein, S. Burrows, and D. Hoppe. TIRA: Con-
grams sometimes demanded inconvenient input specifications for figuring, Executing, and Disseminating Information Retrieval
processing the test data, the powerful parameter substitution mech- Experiments. In A. M. Tjoa, S. Liddle, K.-D. Schewe, and
anism of TIRA made the task achievable. To evaluate each sub- X. Zhou, editors, Proceedings of the Ninth International Work-
mission against the test set, we implemented an additional TIRA shop on Text-based Information Retrieval at DEXA (to appear),
service that sends an execution request for every document pair in Sept. 2012.
the test set to the TIRA service of the submission. Here, the web [5] J. P. A. Ioannidis. Why Most Published Research Findings Are
dissemination capability of TIRA is highly convenient. False. PLoS Medicine, 2(8):696–701, Aug. 2005.
In the near future we plan to give the PAN 2012 participants the [6] M. Potthast. Technologies for Reusing Text from the Web. PhD
opportunity to opt-in for a public release of their plagiarism detec- Thesis, Bauhaus-Universität Weimar, Dec. 2011.

55
Design, implementation and experiment of a YeSQL Web
Crawler

Pierre Jourlin Romain Deveaud Eric Sanjuan-Ibekwe


Laboratoire d’Informatique Laboratoire d’Informatique Laboratoire d’Informatique
d’Avignon d’Avignon d’Avignon
Université d’Avignon et des Université d’Avignon et des Université d’Avignon et des
pays de Vaucluse pays de Vaucluse pays de Vaucluse
BP 1228, 84911 AVIGNON BP 1228, 84911 AVIGNON BP 1228, 84911 AVIGNON
CEDEX, France CEDEX, France CEDEX, France
[email protected] Romain.Deveaud@univ- [email protected]
avignon.fr
Jean-Marc Francony Françoise Papa
UMR PACTE UMR PACTE
Université Pierre Mendès Université Pierre Mendès
France - Grenoble 2 France - Grenoble 2
Grenoble, France Grenoble, France
[email protected] [email protected]

ABSTRACT Java-language source code3 of these two software toolkits


We describe a novel, “focusable”, scalable, distributed web are rather large and complex: 29349 lines of source code for
crawler based on GNU/Linux and PostgreSQL that we de- Apache Nutch (v1.4) and 107377 for Heritrix (v3.1.0). An-
signed to be easily extendible and which we have released other possible drawback from the researcher’s perspective is
under a GNU public licence. We also report a first use case that they both access the data using unconventional systems
related to an analysis of Twitter’s streams about the french : Nutch relies on HadoopTM and Heritrix relies on its own
2012 presidential elections and the URL’s it contains. code for handling Internet Archive ARC files.

Categories and Subject Descriptors These systems belong to the“NoSQL”or “UnQL”approaches,


H.4 [Information Systems Applications]: Miscellaneous supported by the assumption that the widely used SQL re-
lational database standard is a inherent cause of scalabil-
ity issues. However, this assumption is contested by several
General Terms database experts. For instance, recent developments around
Algorithms ; Design ; Experimentation the PostgreSQL project allow it to perform as well as- and
sometimes outperform some - NoSQL databases[3]. This al-
Keywords ternative approach has been named YesQL.
Web Crawler; Web Robot; Web Spider; PostgreSQL ; Twit-
ter ; Web ; Social Networks By taking profit of the capabilities of a PostgreSQL server,
we implemented our web crawler in a total of only 911 lines
1. INTRODUCTION of C-language code and 200 lines of SQL and PL/pgSQL. At
Where scalability is concerned, Apache Nutch R 1 and Her- the time this article was written and as far as we know, this is
itrix2 are probably the best-known and the most-accomplished the only available web crawler that is based on PostgreSQL.
open-source web crawlers. They both are sensible choice The tests we performed have shown that instances of the
for Information Retrieval (IR) researchers who intend to crawler could process over 20 millions of URLs in a few days
build large web corpora. They can be configured to spe- without beeing noticeably slowed by database operations.
cific needs and can be extended and modified. However, the We thus believe this web crawler is well worth considering
1
by IR researchers and programmers.
http://nutch.apache.org/
2
https://webarchive.jira.com/browse/HER
2. SOFTWARE DESCRIPTION
The source code repository is located at GitHub4 under a
GNU public license. Everyone can therefore easily download
an up-to-date version of the toolkit, provide user’s feedback,
or join the developer’s team. The crawling system can be
briefly summarized as follows:
3
The copyright of this article remains with the authors. SIGIR 2012 Work- There are alternatives written in Python, e.g. : Mechanize
shop on Open Source Information Retrieval. August 16, 2012, Portland, (36419 lines of code) and Scrapy (23096 lines of code)
4
Oregon, USA. https://github.com/jourlin/WebCrawler

56
CREATE OR REPLACE FUNCTION
ScoreURL(url url) RETURNS bigint AS
$$
DECLARE
score INT;
normurl TEXT;
BEGIN
normurl=normalize(CAST(url AS text));
IF CAST(url_top(url) AS TEXT) =’fr’ THEN
score=1;
ELSE
score=0;
END IF;
IF substring(normurl, ’keyword1’) IS NOT NULL THEN
score=score+2;
END IF;
IF substring(normurl, ’keyword2’) IS NOT NULL THEN
score=score+1;
END IF;
RETURN score;
END;
$$ LANGUAGE plpgsql;
Figure 1: Web crawler organisation

Figure 2: A Webcrawler strategy written in


• Links and URLs’ data are stored in a PostgreSQL5 PL/PGSQL : scoring URLs
database.
• The user can launch several crawler’s instances on sev-
eral, possibly distant machines.
• Each instance of the crawler iteratively:
1. fetches a list of URLs to be explored by sending
a simple SQL query to the database;
2. downloads the web pages;
3. extracts new hypertext links to possibly new URLs;
4. sends the new data back to the server. CREATE OR REPLACE FUNCTION
ScoreLink(context text) RETURNS int AS
$$
Figure 1 shows how the communication between internet, DECLARE
web crawler’s instances and the PostgreSQL server. score INT;
normcontext TEXT;
The choice of URLs to be fetched is made by one SQL BEGIN
query and two PL/pgSQL additive scoring functions: one normcontext=normalize(context);
score=0;
scores the URL according to its content, the other scores IF (substring(normcontext, ’keyword1’) IS NOT NULL) THEN
the URL according to the textual context in which they score = score +1;
are linked. The programmer can thus easily implement any END IF;
focused crawling strategy by modifying a single SQL fetch IF (substring(normcontext, ’keyword2’) IS NOT NULL) THEN
query and two scoring functions. The user can write them in score = score +1;
PL/pgSQL in order to take advantage for instance, of Post- END IF;
RETURN score;
greSQL regular expressions. In order to achieve even better END;
performance, he might also write them in C-language and $$ LANGUAGE plpgsql;
take benefit of PostgreSQL’s dynamic loadable objects capa-
bility. Figures 2 and 3 show a scoring function in PL/pgSQL
that calculates a weighted count of keywords occuring in the Figure 3: A Webcrawler strategy written in
URL itself (Figure 2) or in the anchor text that links to it PL/PGSQL : scoring links
(Figure 3).

Each crawler instance is only responsible for downloading


and processing web pages. The downloading stage is per-
5
http://www.postgresql.org/

57
formed by the very mature GNU/Wget utility6 . The database Depth # crawled % URLs % URLs
URLs covered (a) covered (b)
system is responsible for the coordination of multiple crawlers
0 2 0.00 0.00
(thanks to SQL transactions), uniqueness of stored URLs 1 34 0.08 1.00
and links (thanks to SQL constraints), crawling strategy 2 1026 0.73 4.00
(thanks to PL/pgSQL or C functions), etc. Insertions into 3 8543 1.84 8.00
a single SQL view triggers insertions into the more complex 4 56883 3.06 12.00
internal table structure. 5 368247 7.33 27.00
6 2756671 15.28 40.00

3. USE CASE: COVERAGE OF "TWEETED" Table 1: Tweeted URLs’ coverage. (a): for all 4777
URLS tweeted URLs ; (b): for the top 100 most frequently
3.1 Context tweeted URLs. “Depth” is the minimum number of hy-
Recent open free network visualisation tools have made eas- perlinks that one has to follow to reach an URL from the
ier the qualitative analysis of large social networks[1]. Based initial set.
on these tools, scientists in humanities can visualize large re-
lational data which lead to new hypothesis that will require
further network crawling and data extraction. We show an
example of such interaction between humanities and com-
26638 of those tweets contained a shortened URL (28.4%)
puter scientists made possible by our YeSQL crawler.
from a set of 10447 unique shortened URL corresponding to
4777 unique effective URLs.
Political scientists have formulated the hypothesis that for
the 2012 French presidential elections, candidates’ commu-
This filtering produced a homogeneous corpus based on a
nication departments accepted Twitter as a target media
usage logic and identical annunciation rules. The reference
and integrated it to their communication system.
to candidates’ addresses produces a multi-voiced discourse
folded up on the proper space of Twitter. Each “tweeted”
Their strategy was to better control their communication
URL is functioning as an interface with the outside of this
and to improve the dissemination of political messages they
space and brings back external information from the media
convey, in order to influence public opinion. What was at
space. Their identification is important as a marker of dis-
stake ? The saturation and the meshing of the media sphere,
course evolution and also for its anchorage in the media and
with coherent messages whatever the channel of dissemina-
political topicality .
tion they choose.
Independently from this collection, we started a web crawler
The empowerment of their communication during the cam-
instance that was allowed to download 20 pages in parallel,
paign was linked to their capacity :
from february 20th at 00:00am to february 26th at 10:55pm.
It was initiated on 32 initial URLs from newspapers’ polit-
• to consolidate their network of opinion leaders thanks ical pages and candidates’ web sites and collected over 2.7
to Twitter, millions of URLs. In the following tables, we call ”depth”
the minimum number of links needed to navigate from an
• to be more reactive and to communicate “just in time” initial URL (depth=0) to a crawled URL.
if unexpected events occur,
Table 1 shows the proportion of tweeted effective URLs that
• to strengthen the efficiency of their activists network.
were crawled during this period. The fourth column shows
that most frequently tweeted URLs are more likely to be
As a consequence, the relationships between their different covered by the crawl. These results show that most popular
communication devices has to be analysed. URLs have a significant probability to be directly retrieved
by the crawler after millions of URLs have been crawled.
3.2 Experiment Figure 4 shows the proportion of tweeted URLs found in the
In order to evaluate this hypothesis, we conducted a cap-
crawling per tweeted frequency (number of times that the
ture of Twitter’s messages and a parallel though indepen-
URL was tweeted). This gives an estimation of the crawling
dent web crawl of candidate web sites and newspaper’s po-
coverage with regard to URL’s visibility.
litical pages. We then attempted to compare the two data
sources. Twitter’s markers (e.g. ’#’ and ’@’) facilitates the
Table 2 shows that the similar problem of tweeted domains
production of statistics on a given collection. Regarding the
instead of tweeted URLs is substantially easier. Indeed, the
web, drawing statistics require a very well structured crawl,
coverage is noticeably higher when only the URL’s domains
with good identification of identical URL and page contents.
are considered. In particular, 100 most tweeted domains are
The YeSQL web crawler proved to be well suited to this task.
almost totally (97.73%) covered by the web crawl.
By filtering tweets from candidates, to candidates or men-
More generally, we can observe that high “domain” coverage
tioning a candidate (e.g. @fhollande, @bayrou, @melan-
figures are obtained for relatively low “depth” levels. This
chon2012, @SARKOZY 2012, etc.), we recorded 93592 tweets
suggests that the most popular URLs originates from sites
from february 6th at 00:00am to february 13th 2012 at 00:00am.
that are the nearest neighbours of the 32 initial newspapers’
6
http://www.gnu.org/software/wget/ political pages and candidates’ web sites.

58
informations are produced, published and tweeted by these
actors.

We could also question the existence of significant reporting


practices outside the control of political apparatus’ dissemi-
nation strategies. If our results are confirmed in finer grain
analysis, we will be able to reconsider the self-organising
hypothesis that people tend to associate to social networks.

5. REFERENCES
[1] W. W. Cohen and S. Gosling, editors. Proceedings of
the Fourth International Conference on Weblogs and
Social Media, ICWSM 2010, Washington, DC, USA,
Figure 4: Crawler coverage per tweeted URL fre- May 23-26, 2010. The AAAI Press, 2010.
quency. [2] J.-P. Cointet and C. Roth. Local networks, local topics:
Structural and semantic proximity in blogspace. In
Depth # crawled % domains % domains Cohen and Gosling [1].
domains covered (a) covered (b) [3] G. M. Roy. Perspectives on NoSQL. In PGcon 2010:
0 1 0.00 0.00
1 31 2.50 18.18
PostgreSQL Conference for Users and Developers,
2 95 4.43 29.55 Ottawa, Canada, 2012.
3 312 11.93 50.00
4 1596 27.73 81.82
5 8137 49.66 95.45
6 45992 72.50 97.73

Table 2: Tweeted URLs’ domain coverage. (a): for


all 4777 tweeted URLs ; (b): for the top 100 most fre-
quently tweeted URLs. “Depth” is the minimum number
of hyperlinks that one has to follow to reach an URL from
the initial set.

This is not surprising considering that the web of political


blogs is stable along month periods [2]. Moreover, all main
French newspaper offer a blog service to their readers. The
readers contributions to their websites allow them to capture
most of the queries on the web dealing with politics.

Results in Table 2 also allow us to expect much better cover-


age of URLs by simply launching more crawler’s instances,
on a single or on multiple machines.

4. CONCLUSION
The web crawler we presented does not have all the func-
tionalities that offer older and more ambitious projects such
as Nutch and Heritrix. However, we have shown that recent
functionalities introduced in PostGreSQL about data struc-
tures, triggers and language programming allow to develop
powerful web mining tools that can deal with highly redun-
dant data as well as less frequent signals. We illustrated
this with a scalable crawler that can explore web networks
at a fine grained level. In particular, this crawler can help
in comparing the web to social networks like Twitter.

In this particular configuration and for this domain, current


events about the french electoral campaign irrigates the two
information spaces, the web and Twitter. The practice of
“tweeting” URLs becomes usual in the context of modern
approaches of information reporting and monitoring.

As we entered this field of investigation by studying the po-


litical “actors”, we saw that a significant part of original

59
From Puppy to Maturity: Experiences in Developing Terrier

Craig Macdonald, Richard McCreadie,


Rodrygo L.T. Santos, and Iadh Ounis
[email protected]
School of Computing Science
University of Glasgow
G12 8QQ, Glasgow, UK

ABSTRACT to provide state-of-the-art efficient indexing and effective re-


The Terrier information retrieval (IR) platform, maintained trieval mechanisms. For example, due to its modularity,
by the University of Glasgow, has been open sourced since Terrier allows various ways of changing the ranking of doc-
2004. Open source IR platforms are vital to the research uments, providing a huge variety of weighting models, in-
community, as they provide state-of-the-art baselines and cluding Okapi BM25 [17], language modelling [10, 23] and a
structures, thereby alleviating the need to ‘reinvent the wheel’. vast number of models from the Divergence from Random-
Moreover, the open source nature of Terrier is critical, since ness framework [2]. It also includes field-based document
it enables researchers to build their own unique research on weighting models for tackling more structured documents,
top of it rather than treating it as a black box. In this such as BM25F [22] or PL2F [13]. As a result, Terrier is
position paper, we describe our experiences in developing well known within IR evaluation forums such as TREC and
the Terrier platform for the community. Furthermore, we FIRE as the basis of many effective systems.
discuss the vision for Terrier over the next few years and Over the last decade, the Terrier platform has been devel-
provide a roadmap for the future. oped and enhanced by a wide range of academics world-wide.
Indeed, contributions made to Terrier have been driven by
the desire to explore new research directions, and will remain
Categories and Subject Descriptors so in the future. For instance, Terrier is currently being ex-
H.3.3 [Information Storage and Retrieval]: Information tended to tackle the new challenges of real-time search tasks,
Search and Retrieval e.g. incremental indexing, live search and reproducibility of
experimentation in such dynamic search scenarios. However,
open source platforms require significant investments in time
1. INTRODUCTION and manpower to develop and maintain. While such invest-
The origins of the Terrier open source information re- ments may not directly lead to research outcomes, they are
trieval (IR) platform1 within the University of Glasgow can of utmost importance for the research community. For this
be traced back to 2001, when it was first created in Java reason, in this paper, we detail our experiences in devel-
to provide a common basis for research students to use for oping Terrier and provide a roadmap for future releases of
their PhD research. Since then, the platform has grown to the platform. In this way, the contributions of this position
be a scalable and mature open source platform released un- paper are two-fold: we describe the growth of the Terrier
der the Mozilla Public License (MPL)2 , aimed at researchers platform along the past decade and its latest developments,
and practitioners, permitting the rapid and effective research followed by a roadmap for the next generation of the open
and development of information retrieval technologies. source search platform.
Two of the key goals of Terrier are to be flexible and ex- This paper is structured as follows. Section 2 discusses the
tensible, such that the platform can act as a corner-stone philosophy that guides the development of Terrier. Section 3
upon which both the academic community and practition- describes recent developments of the platform, whereas Sec-
ers can build. To this end, Terrier follows a modular design, tion 4 provides a roadmap for future developments. Finally,
whereby different components of the indexing and retrieval Section 5 provides our concluding remarks.
process can be customised. For instance, when working with
Twitter, it can be advantageous to use a dedicated tokeniser 2. BUILDING FOR SUCCESS
that removes URLs and mentions from the text. By combin- Our overriding belief behind Terrier is that an information
ing a modular design with an open source nature, Terrier can retrieval (IR) system ‘should just work... out-of-the-box.’
be adapted for a potentially unlimited number of use-cases. Many users will come to Terrier, and their first experience
However, it is also important to provide as much func- using the platform is key—we want to ensure that the crucial
tionality out-of-the-box as possible. Indeed, Terrier strives first experience facilitates the aim that they have for using
1 the platform. In particular, we identify four dimensions that
http://terrier.org are key to the user experience, namely:
2
http://www.mozilla.org/MPL/
• Effectiveness: The platform should be effective by de-
fault. Moreover, it should provide easy access to the
Copyright is held by the author/owner(s).
SIGIR 2012 Workshop on Open Source Information Retrieval. accepted state-of-the-art techniques, permitting exper-
ACM August 16, 2012, Portland, Oregon, USA. iments to be conducted with minimal development cost.

60
• Efficiency: Scientific experiments in information re- 1e+10
trieval have advantages over other scientific fields, in ClueWeb09

Number of Documents
1e+09
that experimental ground truths (i.e. relevance assess-
ments) are generally reusable. While the efficiency of 1e+08
an IR research platform is not paramount, we believe GOV2 Blogs08
that the system should be generally efficient so as to 1e+07
facilitate fast experiment iterations. WT10G Blogs06
1e+06 Disk 1&2 Disk 4&5 GOV
• Scalability: The size of IR test corpora has grown WT2G
100000
500-fold since the first open source release of Terrier, 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010
as illustrated in Figure 1. Regardless of the scale of Year
hardware resources available to a researcher, we be-
lieve that the IR system should be able to index and Figure 1: Evolution of the size of TREC corpora.
retrieve without challenging configuration.
ing [9], which was released as part of Terrier 2.0. The idea
• Adaptability: It must be possible to adapt the system
behind single-pass indexing is that a central index structure
to new requirements, whether this encompasses new
can be built a document-at-a-time in a ‘single-pass’ over the
retrieval strategies, new corpora, or different experi-
collection. Moreover, this index can be created under tight
mental paradigms.
memory constraints by compressing the inverted files and
These dimensions (which we collectively denote EESA) periodically writing partial posting-lists to disk, facilitating
underlie the decisions behind the development of the Terrier indexing on a single machine. Single-pass indexing is a fast
platform, and the plans for its future. In particular, with and efficient means to index smaller (by today’s standards)
Terrier, we aim to ensure that indexing and retrieval can collections, spanning 100 million documents or less.
be effectively performed by making efficient use of whatever However, for larger collections such as the TREC ClueWeb-
resource is available, be it a single machine or a large cluster. 09 corpus that is comprised of 1.2 billion documents,3 single-
An important choice in Terrier has centred around the pass indexing is a slow process requiring weeks of processor
effectiveness/efficiency/adaptability tradeoff. In particular, time to complete on a single machine. MapReduce is a pro-
IR researchers may not know a priori which particular weight- gramming paradigm for the processing of large amounts of
ing model they intend to use during retrieval time. Indeed, data by distributing work tasks over multiple processing ma-
some retrieval approaches decide on the choice of weighting chines [6]. The central concept underpinning MapReduce is
model on a per-query basis (e.g. [8]). For this reason, we that many data-intensive tasks are based around performing
do not believe that efficient retrieval approaches (e.g. score- a map operation with a simple function over each ‘record’ in
at-a-time [3]) that tie the index to a particular retrieval ap- a large dataset. Terrier 2.2 became the first open source IR
proach are suitable for an experimental IR platform, even if platform to implement large-scale parallelised indexing us-
they can produce marked efficiency improvements. ing MapReduce [14] and its Java implementation Hadoop.4
On the other hand, there are software engineering chal- This has enabled Terrier to scale its indexing process to ef-
lenges in maintaining and improving a large platform such ficiently tackle collections spanning billions of documents or
as Terrier. For instance, since Terrier 3.0 we have been fol- more, given a suitable cluster of machines.
lowing a testing regime that ensures that changes do not im- Furthermore, it is not just the scaling of existing IR pro-
pact on correctness (unit tests) and effectiveness (end-to-end cesses that is of interest. To facilitate new search tasks,
tests). In the following, we identify some specific improve- the indexing process needs to be flexible and extensible. In-
ments made to the Terrier over the last few years and how deed, this is especially true as the focus of IR continues to
these are related to the EESA dimensions, before going on move from traditional web documents to more specialised
to identify a roadmap for further developments of Terrier. domains, such as the search of social media and other user-
generated content sources [21]. To this end, the open source
and modular nature of Terrier allows for the addition of in-
3. RECENT IMPROVEMENTS dexing capability for new collections with specialised struc-
In this section, we highlight recent improvements in Ter- tured documents, in addition to custom tokenisation, stop-
rier, covering both its indexing and retrieval architectures. word removal and stemming. For example, we released a
custom package for Terrier 3.0 to support the processing
3.1 Indexing of JSON tweets outside the normal release cycle.5 More-
Since the inception of Terrier, the ability to quickly pro- over, the entire indexing process can be extended to tackle
duce compressed index structures representing collections of entirely new search tasks. For instance, the ImageTerrier6
documents has been critical. However, over the last decade, project enhanced Terrier with image retrieval functionality.
the scale of the document collections of interest, the speci- Recent developments have focused on enhancing the de-
fications of commodity hardware used for indexing and the ployment and demonstration capabilities of Terrier. From
search tasks that are being investigated have dramatically an indexing perspective, this involves efficiently storing doc-
changed. For instance, Figure 1 illustrates how the size of ument metadata for later display to the user. Terrier 3.5
IR test collections has grown over a 17 year period. These supports automated metadata extraction and snippet gen-
changes have and continue to introduce new indexing chal-
3
lenges that have driven the continual enhancement of Ter- http://boston.lti.cs.cmu.edu/Data/clueweb09
4
rier’s indexing processes. http://hadoop.apache.org
5
The basis for much of the indexing functionality within http://ir.dcs.gla.ac.uk/wiki/Terrier/Tweets11
6
Terrier stems from the development of single-pass index- http://www.imageterrier.org

61
of query-dependent features with a single pass over the in-
verted index at retrieval time. The latter is enabled by an
improved matching mechanism that keeps track of the actual
postings for documents that might end up among the top re-
trieved. Another valuable source of evidence, which conveys
how a document is described by the rest of the Web, is the
anchor text of the incoming hyperlinks to this document.
Integrating anchor-text extraction to the indexing pipeline
of Terrier will provide a unified solution for leveraging this
rich evidence as an additional feature for effective retrieval.

Non-global configuration (adaptability).


An important direction for improving the adaptability of
Terrier is to enable multiple instances of its indexing and
retrieval pipelines to run concurrently. A crucial develop-
ment in this direction is a configuration system that admits
Figure 2: Terrier user interface for Twitter search.
non-global, instance-specific setups. For instance, a typical
search scenario that may require multiple instances of a re-
eration that can be saved within a specialist meta index trieval process is search result diversification. In particular,
structure. This functionality can be paired with the Terrier effective diversification can be attained by ensuring that the
front-end search interface to facilitate custom search appli- produced ranking covers multiple aspects of an ambiguous
cations, such as Twitter search, as illustrated in Figure 2. query, represented as query reformulations [18, 19].
3.2 Retrieval
Dynamic pruning (scalability and adaptability).
When retrieving from large corpora in constrained mem-
Dynamic pruning strategies, such as WAND [4], can in-
ory situations, it is possible that the system does not have
crease efficiency by omitting the scoring of documents that
sufficient memory available to decompress the entire post-
can be guaranteed not to make the top-k retrieved set—
ing list for common query terms. From Terrier 3.0, we re-
a feature known as safeness. These pruning strategies rely
designed the retrieval mechanism such that the posting list
on maintaining a threshold score that documents must over-
for each term is decompressed in a streaming fashion, using
come in order to be considered in the top-k documents. Each
an iterator design pattern. Moreover, this had an additional
term is associated with an upper bound stating the maxi-
benefit in permitting support for Document-at-a-Time re-
mal contribution of the weighting model to any document’s
trieval (DAAT) strategies. Indeed, DAAT retrieval strate-
relevance score. By comparing upper bounds on the scores
gies are advantageous in that the number of document score
of the terms that have not been scored to the threshold,
accumulators is markedly reduced compared to Term-at-a-
i.e. the current k-th document, the pruning strategy de-
Time (TAAT) scoring, ensuring an efficient retrieval process.
cides on to whether to skip the scoring of documents during
On the internationalisation front, we have been working
retrieval. In addition, in WAND, skipping is also supported
on extending Terrier to index and retrieve from East Asian
by underlying posting list iterators in order to reduce disk
corpora. In particular, in Terrier 3.5, we refactored the way
IO and further increase efficiency. However, traditionally in
documents are processed, such that text parsing and tokeni-
the literature, the upper bounds are pre-calculated at index-
sation are now fully separated operations. As a result of
ing time and stored in the index. As a result, the generated
this refactoring, Terrier now supports pluggable tokenisers
index is only suited for efficient retrieval using one partic-
for different languages, adding to the overall adaptability
ular weighting model, where all the returned search results
of the platform. In a first test of the new tokenisation ar-
of queries are ranked using a single weighting model. In-
chitecture, Terrier delivered out-of-the-box state-of-the-art
stead, in Terrier, we avoid tying up the generated index to a
retrieval performance on news and web corpora for both
particular weighting model, by deploying safe upper bound
Chinese and Japanese [20].
approximations for various retrieval models, which can be
calculated on-the-fly at query execution time [12]. By doing
4. ROADMAP FOR TERRIER so, Terrier supports the recent trend of deploying selective
In the following, we highlight new functionalities devel- retrieval approaches, where the retrieval strategy varies from
oped for Terrier, which we plan to release in future open a query to another [7, 19].
source versions. Each of these functionalities is key to im-
proving one or more dimensions of EESA within the Terrier Distributed and real-time search (scalability).
platform. In particular, the massive scale and heterogeneity While a MapReduce indexing process can efficiently index
of current corpora and the increasingly complex informa- the 1.2B documents of the ClueWeb09 corpus, retrieval from
tion needs of search users limits the effectiveness of tradi- a monolithic single index shard is not efficient. Document-
tional ranking approaches based on a single feature. Instead, partitioned distributed indexing [5] ensures that efficient re-
effective retrieval is increasingly moving towards machine- trieval can be attained regardless of the index size. Re-
learned ranking functions combining multiple features [11]. cently, search in real-time scenarios such as live search over
tweets have become popular [21]. Real-time search poses dif-
Feature-based retrieval (effectiveness and efficiency). ferent challenges to traditional retrospective retrieval tasks.
Terrier will support the extraction of query-independent In particular, documents are not contained within a static
features at indexing time, as well as the efficient extraction corpus, but rather arrive over time in a streaming fashion.

62
Moreover, both indexing and retrieval operations occur in 6. REFERENCES
parallel, hence index structures must always be searchable, [1] M.-D. Albakour, C. Macdonald, I. Ounis,
thread safe and up-to-date. Furthermore, from a research A. Pnevmatikakis, and J. Soldatos. SMART: An open
perspective, there is a need to support the ‘replaying’ of a source framework for searching the physical world. In Proc.
stream, facilitating reproducible experimentation. The de- of OSIR, 2012.
[2] G. Amati. Probability models for information retrieval
velopment of real-time search within Terrier is well advanced
based on Divergence From Randomness. PhD thesis,
and is targeted for merging into the next major open source University of Glasgow, 2003.
release. Terrier’s real-time infastructure is also being further [3] V. N. Anh and A. Moffat. Pruned query evaluation using
developed as part of the SMART EC project to enable low pre-computed impacts. In Proc. of SIGIR, pages 372–379,
latency distributed indexing and retrieval [1].7 2006.
[4] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and
Crowdsourcing for relevance assessment (effectiveness). J. Zien. Efficient query evaluation using a two-level
retrieval process. In Proc. of CIKM, pages 426–434, 2003.
Researchers rely on document relevance assessments for
[5] F. Cacheda, V. Plachouras, and I. Ounis. A case study of
queries to gauge the effectiveness of their systems. Eval- distributed information retrieval architectures to index one
uation forums such as TREC and CLEF play a key role terabyte of text. IP&M, 41(5):1141–1161, 2005.
by providing relevance assessments for many common tasks. [6] J. Dean and S. Ghemawat. MapReduce: Simplified data
However, these venues cannot cover all of the collections and processing on large clusters. In Proc. of OSDI, pages
tasks currently investigated in IR, resulting in the burden of 137–150, 2004.
relevance assessment generation falling upon individual re- [7] X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H.-Y.
searchers. This is an important problem, as relevance assess- Shum. Query-dependent ranking using k-nearest neighbor.
In Proc. of SIGIR, pages 115–122, 2008.
ment generation is often a time-consuming, difficult and po-
[8] B. He and I. Ounis. A query-based pre-retrieval model
tentially costly process. For many IR-related tasks, crowd- selection approach to information retrieval. In Proc. of
sourcing has been shown to be a fast and cheap method RIAO, pages 706–719, 2004.
to generate relevance assessments in a semi-automatic man- [9] S. Heinz and J. Zobel. Efficient single-pass index
ner [16], by outsourcing to a large group of non-expert work- construction for text databases. J. Am. Soc. Inf. Sci.
ers. CrowdTerrier is a soon to be released open source ex- Technol., 54(8):713–729, 2003.
tension to Terrier that leverages crowdsourcing to provide [10] D. Hiemstra. Using Language Models for Information
researchers with an out-of-the-box tool to achieve fast and Retrieval. PhD thesis, University of Twente, 2001.
[11] T.-Y. Liu. Learning to rank for information retrieval.
cheap relevance assessment [15].
Found. Trends Inf. Retr., 3(3):225–331, 2009.
[12] C. Macdonald, I. Ounis, and N. Tonellotto. Upper-bound
Plugin Expansions (adaptability). approximations for dynamic pruning. ACM TOIS,
The growth of the Terrier platform into exciting new areas 29(4):1–28, 2011.
such as crowdsourcing entails increased functionality, but [13] C. Macdonald, V. Plachouras, B. He, C. Lioma, and
also platform complexity. To avoid software bloat, we are I. Ounis. University of Glasgow at WebCLEF 2005:
moving from a monolithic release structure, to a system of experiments in per-field normalisation and language specific
stemming. In Proc. of CLEF, pages 898–907, 2006.
periodic core releases and timely plugin expansions. This
[14] R. McCreadie, C. Macdonald, and I. Ounis. MapReduce
will enable Terrier to continue to grow and evolve to tackle indexing strategies: Studying scalability and efficiency.
future challenges in the dynamic field of IR in line with the IP&M, 2011.
interests of the community. [15] R. McCreadie, C. Macdonald, and I. Ounis. Crowdterrier:
Automatic crowdsourced relevance assessments with terrier.
5. CONCLUSIONS In Proc. of SIGIR, 2012.
[16] R. McCreadie, C. Macdonald, and I. Ounis. Identifying top
In this paper, we described the philosophy that has guided
news using crowdsourcing. Information Retrieval, 2012.
the development of the Terrier IR open source platform since [17] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu,
its first release in 2004. We described some recent develop- and M. Gatford. Okapi at TREC-3. In Proc. of TREC,
ments in the Terrier IR platform, as well as a comprehensive 1994.
roadmap for its forthcoming releases, intended to ensure that [18] R. L. T. Santos, C. Macdonald, and I. Ounis. Exploiting
the platform remains extensible and effective, while provid- query reformulations for web search result diversification.
ing a robust, modern and efficient grounding for building In Proc. of WWW, pages 881–890, 2010.
next generation search engine technologies. The last decade [19] R. L. T. Santos, C. Macdonald, and I. Ounis. Selectively
diversifying Web search results. In Proc. of CIKM, pages
has witnessed a dramatic shift in the scale and nature of ex-
1179–1188, 2010.
periments IR researchers are increasingly being required to [20] R. L. T. Santos, C. Macdonald, and I. Ounis. University of
conduct to test and evaluate new search technologies. Our Glasgow at the NTCIR-9 intent task. In Proc. of NTCIR,
vision for Terrier is to continue empowering researchers and 2011.
practitioners in IR with up-to-date, effective and scalable [21] I. Soboroff, D. McCullough, J. Lin, C. Macdonald, I. Ounis,
tools, allowing them to build and evaluate the next gener- and R. McCreadie. Evaluating real-time search over tweets.
ation IR applications. We hope that many will join us in In Proc. of ICWSM, 2012.
working together towards such an objective. [22] H. Zaragoza, N. Craswell, M. J. Taylor, S. Saria, and S. E.
Robertson. Microsoft Cambridge at TREC 13: Web and
Hard tracks. In Proc. of TREC, 2004.
Acknowledgements [23] C. Zhai and J. Lafferty. A study of smoothing methods for
The authors express their gratitude to all members of the language models applied to information retrieval. ACM
Terrier community who have contributed to the platform TOIS, 22(2):179–214, 2004.
over the last decade.
7
http://www.smartfp7.eu/content/
twitter-indexing-demo

63
Yet another comparison of Lucene and Indri performance

Howard Turtle Yatish Hegde Steven A. Rowe


Center for Natural Language Center for Natural Language Center for Natural Language
Processing Processing Processing
Syracuse University Syracuse University Syracuse University
Syracuse, NY 13078 Syracuse, NY 13078 Syracuse, NY 13078
[email protected] [email protected] [email protected]

ABSTRACT any change. We present the results of both sets of experi-


We present results that compare the performance of Lucene ments here.
and Indri at two points in time (2009 and 2012) using data
from TREC 6 through 8. We compare indexing throughput, 2. PRIOR WORK
index size, query evaluation throughput, and retrieval effec- Lin [3] compared the performance of Lucene and Indri
tiveness. We also examine the degree to which the results as part of a study of the impact of retrieval quality on the
produced by the two systems overlap with an eye toward performance of question answering systems and concluded
estimating the performance increase that might be expected that there was no significant difference between the ranking
by combining the results of the two systems. quality of the two systems. Lin used relatively long queries,
which may account for the performance similarity as Indri
Categories and Subject Descriptors performance is known to degrade with query length. This
study also uses early versions of Indri and Lucene.
H.3 [Information Storage and Retrieval]: Search Pro-
Middleton and Baeza-Yates [4] conducted an early survey
cess; H.3 [Information Storage and Retrieval]: Content
of the features of 17 search engines and conducted exten-
Analysis and Indexing
sive performance tests with 12 of those engines, including
Lucene and Indri. Their tests used TREC data (Disk 4,
General Terms WT10g) and reported indexing time, index size, query eval-
Open source search engines, information retrieval, perfor- uation time (one and two word queries), and retrieval effec-
mance evaluation, fusion of search results tiveness although not all engines participated in all of the
tests. The Middleton and Baeza-Yates study used early ver-
1. INTRODUCTION sions of Lucene (1.9.1) and Indri (2.4); both engines have
changed significantly since their study.
We have used a number of open source and proprietary Perea-Ortega et al [5] compare the ranking performance of
search engines to support research at the Center for Natural three retrieval systems (Lucene, Lemur, and Terrier) when
Language Processing often combining the results from mul- used in a Geographical Information Retrieval (GIR) system.
tiple engines to good effect [2]. The two engines we most They used the GeoCLEF 2007 data and ran both mono- and
often use are Lucene/Solr (http://lucene.apache.org/solr/) bilingual queries. They conclude that Lemur works best
and Indri [6]. for monolingual queries and that Terrier works better for
Lucene/Solr is attractive because it is a relatively full fea- bilingual queries.
ture package that makes it easy to field Web-based appli- Armstrong et al [1] compared the retrieval effectiveness of
cations. Indri is attractive because it offers better search five search engines, including Indri and Lucene (version 2.4),
results and because it offers a highly expressive query lan- using TREC data from 1994 to 2005. Queries were based
guage that allows very fine grained control of a search. The on title plus description fields, similar to the long query ex-
engines are also natural choices because we are familiar with periments described in Section 3. They found a somewhat
them. One author (Rowe) is a Lucene contributor and chair smaller difference between the two systems than reported
of the Lucene Project Management Committee. A second here – for TREC 6 and TREC 8 data they report that
author (Turtle) has contributed to the development of Indri Lucene’s Mean Average Precision (MAP) scores are 3% to
and Lemur. 4.5% lower than Indri whereas our experiments show MAP
In order to test the assertion that Indri produced bet- scores to be 5.6% lower. Differences in the Lucene version
ter rankings, to assess the likelihood that combining Lucene used and details of the experimental setup (e.g., stopwords,
and Indri results would improve overall performance, and stemmer) likely account for the difference.
to improve our understanding of the relative performance
of the two engines we ran a series of experiments in 2009
to compare performance on TREC data. Both engines have
3. EXPERIMENTS
evolved since 2009 so we reran the tests this year to evaluate Two sets of experiments were run to compare Indri with
Lucene/Solr performance at two points in time. The first
Copyright 2012 by the authors. set, originally run in October of 2009 but repeated on more
SIGIR 2012 Workshop on Open Source Information Retrieval, August 16,
2012, Portland, Oregon, USA modern hardware, compares the versions of Indri and Lucene
ACM X-XXXXX-XX-X/XX/XX. that were current at the time. The second set compares the

64
TREC TREC the two versions by about 24%. Indri indexing is faster than
Disk 4 Disk 5 Total Solr for both experiments but the difference is greatly re-
Number of documents 293,710 262,367 556,077 duced, Solr indexing was slower by a factor of 1.7 in 2009
Collection size (Mb) 1,194 945 1,344 but only by a factor of 1.2 in 2012.
Number of queries 150 150
4.2 Query evaluation
Table 1: Collection statistics Query evaluation times are shown in Table 3. There is lit-
tle difference in the query throughput of the two engines for
short queries and no change in performance between 2009
versions of Indri and Lucene that were current in June of and 2012 for short queries. For long queries there are sig-
2012. Out-of-the-box settings were used for both systems nificant differences. Indri is significantly slower than Lucene
with no tuning or special query formulation. for long queries. The increase in time for evaluating long vs
While our focus is on the performance of the Indri and short queries is between 20 and 25 for Indri (long queries are
Lucene search engines, the experiments are run using their roughly 7 times longer than short queries) and only a fac-
respective wrappers, Lemur and Solr. In 2009, the current tor of 2 for Lucene. Indri query evaluation for long queries
version of Lemur was 4.10 which used Indri version 2.10. slowed between 2009 and 2012.
The current version of Solr was 1.4 which used Lucene ver- Note that for these tests the entire index for each of the
sion 2.9.1. By June 2012, the Lemur software had been systems was cached by the operating system as the test ma-
repackaged so that the wrapper software and search engine chine was equipped with 16Gb of memory and the combined
were combined in a single distribution, Indri 5.3. In 2012, size of the two indexes is only 5.2 Gb. Each system was run
Solr and Lucene remained separate packages but the version once to prime the OS cache then run 3 to 5 times to gather
numbers had been aligned so the current version of Solr was timings. Comparing performance when the the collections
3.6 which used Lucene version 3.6. must be read from disk is for future work.
We collected performance information on indexing speed, Indri is much more processor intensive than Lucene. Dur-
index size, query evaluation times for two query sets, ranking ing the experiments Indri used essentially 100% of a CPU
performance for those queries (using trec eval), and overlap core whereas Lucene used roughly 50%. Indri generally used
between the results produced by the two systems. All ex- less memory – for the 2012 versions running long queries In-
periments were run on an Intel(R) Xeon(R) CPU E5335 @ dri used up to 45Mb of memory whereas Lucene used 150
2.00GHz (four cores) running Debian Squeeze v6.0.4 and to 300Mb.
Java 1.6 (Oracle). While the test system was a four core
system, all tests were single threaded. The test system is 4.3 Retrieval effectiveness
equipped with 16Gb of memory but both Indri and Solr Retrieval effectiveness results are shown in Tables 4 (2009)
were only given 1Gb. and 5 (2012). For short queries, Indri produces a signifi-
cantly better ranking. Performance as measured by MAP is
3.1 Data 44% less for Lucene. Using precision at fixed ranks of 10 and
We used a single data set consisting of TREC disks 4 and 20, Lucene performance is roughly 30% lower, using bpref
5 for both sets of experiments. The Porter stemmer and the Lucene is 26% lower. For long queries, the differences are
default Solr stop word list (35 words) was used for both the smaller but Indri still produces noticeably better rankings –
Indri and Lucene collections. Collection statistics are shown Solr is 16% lower using MAP and 14% lower using bpref.
in Table 1. The change in retrieval effectiveness between 2009 and
2012 is shown in Tables 6 (Indri) and 7 (Solr). For both sys-
3.2 Queries tems the change is small. Indri showed no change for short
Two sets of queries were generated from TREC topics 301- queries and mixed results for long queries (slight increase in
450 (TREC 6 through 8). The first query set (short queries) P10 and P20). Solr showed small improvements for short
consists of the text from the title element of the TREC top- queries and mixed results for long queries.
ics. The short queries average 2.6 words per query. The
second set (long queries) consists of the text from both the 4.4 Overlap
title and description elements with an average length of 18.7
words per query. The queries were completely unstructured
Short queries Long queries
and made no use of proximity or other special query lan-
guage features. Indri 5.3 Solr 3.6 Indri 5.3 Solr 3.6
P(in OL) 0.2076 0.4653
4. RESULTS P(+|in OL) 0.4824 0.4688
P(-|in OL) 0.4363 0.4546
4.1 Indexing P(?|in OL) 0.0813 0.0767
The results of the indexing experiment are shown in Ta- P(+) 0.3721 0.2587 0.4167 0.3813
ble 2. The index sizes remained the same for both systems P(-) 0.5112 0.5967 0.5127 0.5547
between 2009 and 2012. Both systems produce indexes that P(?) 0.1167 0.1447 0.0707 0.0640
are roughly twice the size of the source file. Indri produced P(+|not OL) 0.3577 0.2057 0.3596 0.3040
a more compact index; the Solr index is roughly 17% larger P(-|not OL) 0.5253 0.6450 0.5697 0.6429
than the Indri index. The indexing time results are quite P(?|not OL) 0.1170 0.1493 0.0707 0.0531
different. Indri 5.3 indexing time increased from Indri 4.1
by about 10% whereas Solr indexing time decreased between Table 8: Overlap in top 10 ranks

65
2009 2012
Lemur 4.1 Solr 1.4 Indri 5.3 Solr 3.6
Index size (gigabytes) 2.4 (1.8x) 2.8 (2.1x) 2.4 2.8
Indexing time (sec) 863 1,461 942 1,113
Throughput (Mb/sec) 1.6 0.9 1.4 1.2

Table 2: Indexing results

2009 2012
Lemur 4.1 Solr 1.4 Indri 5.3 Solr 3.6
Short queries (sec) 10 13 10 13
(sec/query) 0.07 0.09 0.07 0.09
Long queries (sec) 200 25 251 25
(sec/query) 1.33 0.16 1.67 0.16

Table 3: Query evaluation times

Short queries Long queries


Lemur 4.1 Solr 1.4 Change Lemuri 4.1 Solr 1.4 Change
MAP 0.1951 0.1092 −44.1 0.2235 0.1840 −17.7
Precision at 10 0.3713 0.2573 −30.7 0.4053 0.3827 −5.6
Precision at 20 0.3247 0.2173 −33.1 0.3500 0.3220 −8.0
bpref 0.2219 0.1645 −25.9 0.2449 0.2081 −15.0

Table 4: Indr vs. Solr retrieval effectiveness (2009)

Short queries Long queries


Indri 5.3 Solr 3.6 Change Indri 5.3 Solr 3.6 Change
MAP 0.1948 0.1098 −43.6 0.2224 0.1856 −16.1
Precision at 10 0.3707 0.2607 −29.7 0.4167 0.3813 −8.5
Precision at 20 0.3243 0.2207 −31.9 0.3590 0.3183 −11.3
bpref 0.2219 0.1645 −25.9 0.2433 0.2087 −14.2

Table 5: Indr vs. Solr retrieval effectiveness (2012)

Short queries Long queries


Lemur 4.1 Indri 5.3 Change Lemur 4.1 Indri 5.3 Change
MAP 0.1951 0.1948 −0.2 0.2235 0.2224 −0.5
Precision at 10 0.3713 0.3707 −0.2 0.4053 0.4167 +2.8
Precision at 20 0.3247 0.3243 −0.1 0.3500 0.3590 +2.6
bpref 0.2219 0.2219 0.0 0.2449 0.2433 −0.7

Table 6: Change in Indri retrieval effectiveness over time

Short queries Long queries


Solr 1.4 Solr 3.6 Change Solr 1.4 Solr 3.6 Change
MAP 0.1092 0.1098 +0.5 0.1840 0.1856 +0.9
Precision at 10 0.2573 0.2607 +1.3 0.3827 0.3813 −0.4
Precision at 20 0.2173 0.2207 +1.6 0.3220 0.3183 −1.1
bpref 0.1645 0.1645 0.0 0.2081 0.2087 +0.3

Table 7: Change in Solr retrieval effectiveness over time

The overlap between the results produced by both sys- bers are important for two reasons. First, effectiveness re-
tems is shown in Tables 8 and 9 (+ means document judged sults can be biased in favor of a system that has been used
relevant, − means document judged not relevant, ? means extensively in the TREC experiments if many of the docu-
document not judged, OL means in overlap). These num- ments retrieved by the other system have not been judged

66
Short queries Long queries in only one of the two rankings. For long queries, about
Indri 5.3 Solr 3.6 Indri 5.3 Solr 3.6 half of the documents retrieved appear in only one ranking.
The overlap results also show that even simple strategies
P(in OL) 0.2356 0.4973
for combining results can yield significant improvements in
P(+|in OL) 0.4042 0.4026
retrieval effectiveness.
P(-|in OL) 0.5211 0.5325
P(?|in OL) 0.0747 0.0649
6. REFERENCES
P(+) 0.3274 0.2207 0.3590 0.3183
P(-) 0.5562 0.5880 0.5763 0.6133 [1] T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel.
P(?) 0.1163 0.1913 0.0647 0.0683 Has adhoc retrieval improved since 1994? In
P(+|not OL) 0.3026 0.1515 0.3093 0.2199 Proceedings of the 32nd international ACM SIGIR
P(-|not OL) 0.5750 0.6334 0.6283 0.7119 conference on Research and development in information
P(?|not OL) 0.1224 0.2151 0.0624 0.0681 retrieval, SIGIR ’09, pages 692–693, New York, NY,
USA, 2009. ACM.
Table 9: Overlap in top 20 ranks [2] J. W. Keeling, E. E. Allen, S. A. Rowe, A. M. Turner,
J. A. Merrill, E. D. Liddy, and H. R. Turtle.
Development and evaluation of a prototype search
and will therefore be treated as not relevant. The results in engine to meet public health needs. In Proceedings of
Table 8 suggest that this is not a factor in these experiments the American Medical Informatics Association, pages
– the number of unjudged documents is small for both sys- 693–700, 2011.
tems with between 85% and 90% of all documents judged [3] J. Lin. The role of information retrieval in answering
for short queries and between 90% and 95% for long queries. complex questions. In Proceedings of the
Second, they provide an indication of how much improve- COLING/ACL on Main conference poster sessions,
ment might be achieved by combining the results of the two COLING-ACL ’06, pages 523–530, Stroudsburg, PA,
systems. If the overlap is large then the combined result USA, 2006. Association for Computational Linguistics.
can have little increase in recall so the primary source of [4] C. Middleton and R. Baeza-Yates. A comparison of
improvement is the reordering of the documents based on open source search engines. Technical report,
combined score. If the overlap is smaller then the probabil- Universitat Pompeu Fabra (Barcelona, Spain), Oct.
ity that a randomly selected document from the overlap is 2007. Available at:
relevant is an indication of how well a simple voting strat- http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf.
egy might work. For example, in Table 8 roughly 20% of the [5] J. Perea-Ortega, M. Garcia-Cumbreras,
documents retrieved by the two systems with short queries M. Garcia-Vega, and L. Urena-Lopez. Comparing
were the same. The probability that a document retrieved several textual information retrieval systems for the
by both systems is relevant is 0.4824 which is significantly geographical information retrieval task. In
higher than the probability of relevance achieved by either E. Kapetanios, V. Sugumaran, and M. Spiliopoulou,
system individually (0.3721 for Indri and 0.2587 for Solr). editors, Natural Language and Information Systems,
volume 5039 of Lecture Notes in Computer Science,
5. CONCLUSIONS pages 142–147. Springer Berlin / Heidelberg, 2008.
[6] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft.
The results presented here allow direct comparison of the
Indri: A language-model based search engine for
two search engines. It also allows comparison of the changes
complex queries. Technical Report IR-407, CIIR,
in the two engines between 2009 and 2012.
Department of Computer Science, University of
Index size for the two engines did not change between
Massachusetts Amherst, 2005.
2009 and 2012. Indri produces a somewhat smaller index
(1.8 times as large as the source collection) than Lucene
(2.1 times). In terms of indexing throughput, Indri de-
clined between between 2009 and 2012 (from 1.6Mb/sec
to 1.4Mb/sec) whereas Lucene performance improved (from
0.9Mb/sec to 1.2Mb/sec). In 2012, Indri still enjoyed a slight
advantage over Lucene (1.4Mb/sec vs 1.2Mb/sec).
In terms of query throughput there is little difference be-
tween the two engines for short queries but Indri is signifi-
cantly slower than Lucene for long queries.
In terms of retrieval effectiveness, Indri results are signifi-
cantly better than Lucene results especially for short queries.
Using precision at rank 20, Lucene rankings are roughly
30% worse for short queries and roughly 10% worse for long
queries. Retrieval effectiveness did not change significantly
for either engine between 2009 and 2012, at least for the
simple queries used in these experiments.
The overlap results show that the documents retrieved
by the two engines are significantly different, especially for
short queries. Using the top ten documents retrieved, for
short queries roughly 80% of all documents retrieved appear

67
Phonetic Matching in Japanese

Michiko Yasukawa∗ J. Shane Culpepper† Falk Scholer†


∗ †
Gunma University, Gunma, Japan RMIT University, Melbourne, Australia
[email protected] {shane.culpepper, falk.scholer}
@rmit.edu.au

ABSTRACT and different spellings. These words are categorized into the
This paper introduces a set of Japanese phonetic matching following two types.
functions for the open source relational database PostgreSQL. • Synonyms – words with the same (or, nearly the same)
Phonetic matching allows a search system to locate ap- meaning, different spellings and different pronuncia-
proximate strings according to the sound of a term. This tion.
sort of approximate string matching is often referred to as
fuzzy string matching in the open source community. This • Spelling variation – words with the same meaning,
approach to string matching has been well studied in English different spellings and the same (or, nearly the same)
and other European languages, and open source packages pronunciation.
for these languages are readily available. To our knowledge,
there is no such module for the Japanese language. In this Synonyms are words that are similar in a semantic sense.
paper, we present a set of string matching functions based Approaches such as Latent Semantic Indexing (LSI) [2] and
on the phonetic similarity for modern Japanese. We have word clustering[9] can be used to alleviate this problem.
prototyped the proposed functions as an open source tool Spelling variants are slightly more difficult to classify, and
in PostgreSQL, and evaluated these functions using the test present a difficult problem in ranked document retrieval
collection from the NTCIR-9 INTENT task. We report our systems depending on keyword queries.
findings based on the evaluation results. Recent efforts to improve the effectiveness of IR systems
have included web search result diversification [10]. In order
to increase search effectiveness in this task, an innovative
Categories and Subject Descriptors solution to spelling variants is needed. The goal of this paper
H.3.3 [Information Storage and Retrieval]: Information is take a first step towards providing an open source tool to
Search and Retrieval—query formulation, retrieval models, help with this problem in the Japanese language.
search process; I.7.1 [Document and Text Processing]: The rest of this paper is organized as follows: Section 2
Document and Text Editing—languages, spelling; I.7.3 presents string matching methods for English and Japanese;
[Document and Text Processing]: Text Processing— Section 3 describes our proposed phonetic matching ap-
index generation proach for Japanese; Section 4 reports our findings based
on a preliminary experimental study; and Section 5 presents
conclusions and future work.
General Terms
Open Source RDBMS, Approximate String Matching, Fuzzy 2. RELATED WORK
String Matching, Phonetic Matching, Japanese Information Phonetic matching is a type of approximate string match-
Retrieval ing. As with other approximate pattern matching methods,
edit distance[6] and character-based n-gram search[8] are
1. INTRODUCTION commonly used. The seminal approach to phonetic match-
One interesting but cumbersome problem in IR and NLP ing is the Soundex Indexing System[5]. It can accurately
research is mismatches between the spelling of words. A assign the same codeword to two different surnames that
simple question for this problem is: What if two words have sound the same, but are spelled differently, like SMITH
the same meaning but different spellings? This is the issue and SMYTH. The basic coding rules of Soundex are shown
explored in this paper. A related but more difficult problem in Figure 1. Both SMITH and SMYTH are encoded as
are homographs, or words with the same spelling. The latter S530 by Soundex. In English and other languages, revised
requires word sense disambiguation (WSD)[7], which is the versions and alternatives of Soundex have been proposed. A
task of identifying the meaning of words in a context. In set of these string matching functions, generally referred to
this paper, we focus only on words with the same meaning as “fuzzy string matching”, are deployed in an open source
programming language. These functions are assembled in
an open source relational database, PostgreSQL1 as well and
are applied to the objects in the database or combined with
The copyright of this article remains with the authors. other relational operations.
SIGIR 2012 Workshop on Open Source Information Retrieval. 1
August 16, 2012, Portland, Oregon, USA. http://www.postgresql.org/

68
Step-1 Retain the first symbol and drop all vowels.
Step-2 Replace consonants with the following code. Table 1: The Japanese Syllabary (Fifty Sounds).
Hiragana Symbol Katakana Symbol
b, f, p, v → 1 l → 4 A I U E O A I U E O
c, g, j, k, q, s, x, z → 2 m, n → 5 φ あ い う え お ア イ ウ エ オ 1
d, t → 3 r → 6 a i u e o a i u e o
K か き く け こ カ キ ク ケ コ 2
Step-3 Remove the duplication of code numbers. ka ki ku ke ko ka ki ku ke ko
Step-4 Continue until you get three code numbers. S さ し す せ そ サ シ ス セ ソ 3
If you run out symbols, fill in 0’s sa si su se so sa si su se so
until there are three code numbers. T た ち つ て と タ チ ツ テ ト 4
ta ti tu te to ta ti tu te to
Figure 1: The Soundex Indexing System. N な に ぬ ね の ナ ニ ヌ ネ ノ 5
na ni nu ne no na ni nu ne no
H は ひ ふ へ ほ ハ ヒ フ ヘ ホ 6
For English, phonetic matching for IR systems has been ha hi hu he ho ha hi hu he ho
studied extensively[1, 3, 11]. If Japanese documents are M ま み む め も マ ミ ム メ モ 7
transliterated into a Latin alphabet, English methods can ma mi mu me mo ma mi mu me mo
be applied. However, transliteration contains its inherent Y や ゆ よ ヤ ユ ヨ 8
problems. Some input characters are lost due to the lack ya yu yo ya yu yo
of correspondence between Japanese and Latin alphabets. R ら り る れ ろ ラ リ ル レ ロ 9
How to reduce the transliteration errors is an interesting ra ri ru re ro ra ri ru re ro
related problem, and will be considered in our future W わ ゐ ゑ を ワ ヰ ヱ ヲ 10
research. The issue of effective transliteration was explored wa wi we wo wa wi we wo
by Karimi et al.[4]. Our research interest in this paper is 1 2 3 4 5 1 2 3 4 5
to develop native matching functions for Japanese search
engines without transliteration.

3. OUR METHODOLOGY 3.2 Symbol groups in Japanese


In this section, we present our approach to symbol Table 2 shows symbol groups for our phonetic matching
grouping and phonetic matching in Japanese. functions. The symbol groups are assembled based on
the similarity of Japanese speech sounds. Each symbol
3.1 Japanese writing system and syllabary group is given a unique identifier (ID). In the table, text
in parentheses expresses a commentary on each group and
In the Japanese language, the number of phonetic sounds
the corresponding consonant, e.g., K for カ (ka) and S for サ
is relatively small, and simply expressed in the 5 × 10 grid in
(sa). Vowel symbols, such as ア (a) and イ (i), do not have
Table 1. It shows the Japanese syllabary, “五十音(Gojūon)”
corresponding consonants, and hence φ is in the parentheses.
meaning “Fifty Sounds” in English. In the table, both Hira-
Table 2 composed of 3 distinct parts: “Fifty Sounds,”
gana symbols (the rounded syllabic symbols) and Katakana
“Voiced Sounds,” and “Additional Sounds.” As a whole,
symbols (the angular syllabic symbols) are displayed. Vowel
the table covers all speech sounds in modern Japanese that
symbols, such as ア (a) and イ (i), do not have corresponding
are writable with Katakana symbols in UTF-8 character
consonants, and this is represented as φ in the leftmost
encoding. Different from English phonetic matching that
column. In the table, the gray cells are vacant because the
uses the Latin alphabet or Arabic numeral for input/output
symbols become lost over time. In modern Japanese, there
strings, our matching functions take Katakana symbols for
is no symbol for Y+I (yi), Y+E (ye), W+U (wu), and the
input strings and use Hiragana symbols for output strings.
pronunciation for them are the same as φ+I (i), φ+E (e),
In UTF-8 character encoding, Katakana symbols are from
φ+U (u), respectively. The symbols for W+I (wi) and W+E
ァ (E382A1) to ヶ (E383B6), and the number of Katakana
(we) are outdated, but can still be used in modern Japanese.
symbols is 86. Hiragana symbols are from ぁ (E38181) to
They are normally replaced with symbols for φ+I (i) and
ん (E38293), and the number of Hiragana symbols is 83.
φ+E (e) because of the similarity of the pronunciation. In
The three symbols, ヴ (E383B4), ヵ (E383B5), ヶ (E383B6)
addition to the symbols in the table, the Japanese writing
are special symbols. They are defined in the Katakana part
system takes an additional symbol, ン (n, a syllabic nasal),
only, and there is no corresponding Hiragana symbol for
symbols with a voiced/semi-voiced sound mark, such as ガ
these symbols.
(ga) or パ (pa), lower-case symbols, such as ッ (tsu, a double
In Table 2, Katakana symbols for “Fifty Sounds” (F-01
consonant), or ャ (ya, a contracted sound), and a diacritical
to F-11) are mostly the same as those in Table 1 with the
mark for a prolonged sound, such as ー (a macron). For
exception of Katakana symbols for wi, we and wo because
web queries and documents, classical spellings, such as the
the Katakana symbols, ヰ (wi), ヱ (we), and ヲ (wo) have the
usage of obsolete symbols, ヱ (we) or an uncommon usage of
same pronunciation as イ (i), エ (e), and オ (o), respectively,
lower-case symbols, e.g, ヶ (ke) substituting for ケ (ke), may
in modern Japanese. Hence, F-02 are incorporated into
appear in web queries and documents for stylistic reasons, or
the same group as F-01 in Table 2. Similarly, V-03 are
simple mistakes. Japanese phonetic matching must account
separated from V-04, and incorporated into V-02 because
for such anomalies.
the Katakana symbols, ヂ (di) and ヅ (du) in modern

69
Table 2: The Symbol Groups for Japanese Phonetic Matching.

ID Fifty Sounds [in] Code [out] ID Voiced Sounds [in] Code [out] ID Additional Sounds [in] Code [out]
F-01 アイウエオ (φ) → E38182 あ A-01 ァィゥェォ (lower-case, φ) → E38182 あ
F-02 ヰヱヲ (obs., φ) → E38182 あ A-02 ー (macron, φ) → E38182 あ
F-03 カキクケコ (K) → E3818B か V-01 ガギグゲゴ (G) → E3818C が A-03 ヵヶ (lower-case, K) → E3818B か
F-04 サシスセソ (S) → E38195 さ V-02 ザジズゼゾ (Z) → E38196 ざ
V-03 ヂヅ (obs., Z) → E38196 ざ
F-05 タチツテト (T) → E3819F た V-04 ダデド (D) → E381A0 だ A-04 ッ (lower-case, T) → E381A3 っ
F-06 ナニヌネノ (N) → E381AA な A-05 ン (syllabic nasal, N) → E38293 ん
F-07 ハヒフヘホ (H) → E381AF は V-05 バビブベボ (B) → E381B0 ば
V-06 ヴ (V) → E381B0 ば
V-07 パピプペポ (P) → E381B1 ぱ
F-08 マミムメモ (M) → E381BE ま
F-09 ヤユヨ (Y) → E38284 や A-06 ャュョ (lower-case, Y) → E38283 ゃ
F-10 ラリルレロ (R) → E38289 ら
F-11 ワ (W) → E3828F わ A-07 ヮ (lower-case, W) → E3828F わ

Japanese have the same pronunciation as ジ (zi) and ズ (zu), jppm3 In order to simplify encoded symbols, merge groups
respectively. in the same line (the same row) in Table 2. To be
Each Katakana symbol for “Voiced Sounds” (V-01 to V- more specific, incorporate V-01 into F-03, incorporate
07) is a symbol with a voiced/semi-voiced sound mark. In V-02 and V-03 into F-04, incorporate V-04 and A-04
the table, voiceless and voiced (e.g., K and G), voiced and into F-05, incorporate V-05, V-06 and V-07 into F-07,
semi-voiced (e.g., B and P), and Japanese voiced and foreign incorporate A-06 into F-09. The first symbol is always
voiced (e.g., B and V) are all distinguished and separated retained in this function.
into different symbol groups. “Additional Sounds” (A-01 to
A-07) cover the rest of Katakana symbols and the diacritical jppm4 In order to simplify output codewords, drop A-01,
mark, ー (a macron). A-02, A-04 and A-06 in Table 2. The first symbol is
still retained as is.
3.3 Japanese Phonetic Matching
In the same way as the Soundex Coding System in 3.4 Implementation
English, the matching function in Japanese also encodes Our aim is to provide an open source component of
symbols according to symbol groups. Essential steps in the Japanese phonetic matching. We prototyped the phonetic
matching function are described as follows: matching functions in Japanese as an extended version
Step-1 Encode all strings in text DB in advance. of the fuzzystrmatch module in the open source rela-
tional database, PostgreSQL. We call the new module
Step-2 Encode a query string on arrival. jpfuzzystrmatch and provide its source code under an open
source license2 . To help understanding, an example of the
Step-3 Output a matching set of encoded symbols.
each of the 4 functions (jppm[1,2,3,4]) and an example of
The greatest challenge in designing phonetic matching English phonetic encoding systems with a transliteration
functions is deciding how to group similar phonetic symbols. system are attached with the source code. The module
Our approach accomplishes this challenge empirically rather contains user-defined C-Language functions, and is compiled
than theoretically by using the actual speech sound in into dynamically loadable objects (shared libraries). It is
modern Japanese. Our first phonetic matching function distinguished from PostgreSQL internal functions, and can
(Japanese phonetic matching; jppm) is as follows: be loaded by the server on demand.
jppm1 Retain the first symbol, and encodes the rest as
encoded symbols according to Table 2. The encoded 4. EXPERIMENTS
symbols are a sequence of Hiragana symbols in UTF- Evaluation of phonetic matching is an important problem
8 character encoding. While it categorizes Japanese to consider. In contrast to ad hoc evaluation in IR systems,
speech sounds in a rigorous manner, the output code- there are no standard test collections for phonetic matching
words tend to be too verbose, and consequently cause functions. In our experiment, we adapted a standard
mismatches with similar strings. test collection for IR systems, the NTCIR-9 Japanese
In order to reduce the number of codewords used, we Intent task, for evaluating phonetic matching functions
derive three revised versions from jppm1. They are altered in Japanese. To build a text database to query, 84
from the initial version as follows: million index terms were extracted from 67 million Japanese
documents in the ClueWeb09-JA collection. For document
jppm2 In order to simplify output code symbols, drop all processing, we employed MeCab3 as a morphological an-
symbols if they are vowels (F-01 and F-02) or an
2
“Additional Sound” (A-01 to A-07) in Table 2. The http://www.daisy.cs.gunma-u.ac.jp/jpfsm/
3
first symbol is always retained in this function. http://mecab.sourceforge.net/

70
5. CONCLUSION
Table 3: Experimental Results (Average).
In this paper, we have presented a set of new Japanese
CodeLen #Results EditDist Pscore
phonetic matching functions. We do not attempt to address
jppm1 8.2 6.0 1.88 3.06
the bigger issue of “word sense disambiguation” (WSD) and
jppm2 5.3 131.2 3.73 24.00
synonyms, but rather present a simple approach to captur-
jppm3 8.2 9.2 2.06 4.29
ing similar Japanese terms for ad hoc queries. A phonetic
jppm4 6.3 25.5 2.74 7.19
matching function takes an input sequence of symbols and
converts it into a generic shortened sequence of symbols in
order to find as many fuzzy matches as possible. Some of the
generic codewords may gather too many matches, including
alyzer in Japanese, and Indri4 as an indexing module to correct ones (the same meaning) and wrong ones (different
extract index terms from documents. In order to analyze meaning). However, this approach can be used as a first
how well the new functions work, sufficiently long Katakana pass filter to find potentially relevant documents in large
words from topic words in the test collection were used as Japanese document collections. This subset of documents
query strings. Specifically, we used all Katakana words can then be further refined using relevance feedback or more
which consists of 7 or more Katakana symbols; there were computationally expensive ranking metrics in subsequent
10 such words in the test set. retrieval steps. In future work, we intend to investigate
Since the judgments in the test collection were not the broader issue of WSD and synonyms using phonetic
developed to carry out phonetic matching, how to evaluate matching, and explore other applications of these simple
the developed functions needs to be considered. Generally matching functions.
speaking, it is difficult for human assessors to judge if
two strings are phonetically similar. For example, Zobel
and Dart[11] report a high level of inconsistency among
6. ACKNOWLEDGMENTS
assessors in such a judgment proces. We therefore adopted The first author was supported by JSPS KAKENHI
an automatic evaluation approach. Grant-in-Aid for Young Scientists (B) 21700273. The second
In order to evaluate phonetic matching in an automatic author was supported by the Australian Research Council.
manner, we make the following assumptions.
7. REFERENCES
• If the discrepancy between two spellings is small, [1] Amir, A., Efrat, A., and Srinivasan, S. Advances
phonetic similarity is naturally large. in phonetic word spotting. In CIKM ’01 Proceedings
(2001), ACM, pp. 580–582.
• If many matching strings are returned, the shortened [2] Deerwester, S. C., Dumais, S. T., Landauer,
sequence of symbols is sufficiently generic. T. K., Furnas, G. W., and Harshman, R. A.
Indexing by latent semantic analysis. JASIS 41, 6
Based on the above assumptions, we define the following (1990), 391–407.
score, Pscore to measure the performance of phonetic match- [3] French, J. C., Powell, L., and Schulman, E.
ing functions. Applications of approximate word matching in
information retrieval. In CIKM ’97 Proceedings
#Results (1997), ACM, pp. 9–15.
Pscore = [4] Karimi, S., Scholer, F., and Turpin, A. Machine
1 + avg(EditDist)
transliteration survey. ACM Comput. Surv. 43, 3
Here, #Results is the number of strings returned by (2011), 17:1–17:46.
a phonetic matching function, and avg(EditDist) is the [5] Knuth, D. E. The Art of Computer Programming:
average edit distance between the query string and each Volume 3. Addison-Wesley, 1973.
returned string. Table 3 shows the experimental results.
[6] Navarro, G. A guided tour to approximate string
The average values are across all 10 test queries. In the
matching. ACM Comput. Surv. 33, 1 (2001), 31–88.
table, jppm2 has the highest Pscore . However, manual
[7] Navigli, R. Word sense disambiguation: A survey.
inspection of the results showed that this function returned
ACM Comput. Surv. 41, 2 (2009), 10:1–10:69.
many strings that are phonetically similar, but are not
spelling variants of the query strings. For example, jppm2 [8] Okazaki, N., and Tsujii, J. Simple and efficient
obtained “matui ryousuke” (a male’s name in Japanese) and algorithm for approximate dictionary matching. In
“mattress queen” (a product name) for the query “matoriyo- Coling Proceedings (2010), pp. 851–859.
sika” (“Matryoshka doll”). Since jppm[1,3] have different [9] Slonim, N., and Tishby, N. Document clustering
grouping features from jppm2, some obtained strings are using word clusters via the information bottleneck
different among these functions. The results of jppm4 are method. In SIGIR ’00 Proceedings (2000),
a subset of those of jppm2 because the grouping features pp. 208–215.
are common. Putting together obtained strings, phonetic [10] Song, R., Zhang, M., Sakai, T., Kato, M., Liu,
matching can be a viable solution to find potential spelling Y., Sugimoto, M., Wang, Q., and Orii, N.
variants. However, the obtained strings still need to be Overview of the NTCIR-9 INTENT task. In
filtered using further semantic analysis. In future work, we Proceedings of NTCIR-9 Workshop Meeting (2011),
intend to evaluate these functions in the context of a full IR pp. 82–105.
task. [11] Zobel, J., and Dart, P. W. Phonetic string
matching: Lessons from information retrieval. In
4
http://sourceforge.net/apps/trac/lemur/ SIGIR ’96 Proceedings (1996), pp. 166–172.

71
.
View publication stats

You might also like