SlideShare a Scribd company logo
Lemur Toolkit Tutorial
Introductions Paul Ogilvie Trevor Strohman
Installation Linux, OS/X: Extract software/lemur-4.3.2.tar.gz  ./configure --prefix=/install/path ./make ./make install Windows Run software/lemur-4.3.2-install.exe Documentation in windoc/index.html
Overview Background in Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee   break
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
Language Modeling for IR Estimate a multinomial  probability distribution  from the text Smooth the distribution with one estimated from  the entire collection P(w|  D ) = (1-  ) P(w|D)+    P(w|C)
Estimate probability that document generated the query terms Query Likelihood ? P(Q|  D ) =    P(q|  D )
Kullback-Leibler Divergence Estimate models for document and query and compare = ? KL(  Q |  D ) =    P(w|  Q ) log(P(w|  Q ) / P(w|  D ))
Inference Networks Language models used to estimate beliefs of representation nodes q1 q2 qn q3 I d1 d2 d3 di
Summary of Ranking Techniques use simple multinomial probability distributions to model vocabulary usage The distributions are smoothed with a collection model to prevent zero probabilities This has an idf-like effect on ranking Documents are ranked through generative or distribution similarity measures Inference networks allow structured queries – beliefs estimated are related to generative probabilities
Other Techniques (Pseudo-) Relevance Feedback Relevance Models  [Lavrenko 2001] Markov Chains  [Lafferty and Zhai 2001] n -Grams  [Song and Croft 1999] Term Dependencies  [Gao et al 2004, Metzler and Croft 2005]
Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
Indexing Document Preparation Indexing Parameters Time and Space Requirements
Two Index Formats KeyFile Term Positions Metadata Offline Incremental InQuery Query Language Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri Query Languages
Indexing – Document Preparation TREC Text TREC Web Plain Text Microsoft Word (*) Microsoft PowerPoint (*) Document Formats: The Lemur Toolkit can inherently deal with several different document format types without any modification: HTML XML PDF Mbox (*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.
Indexing – Document Preparation If your documents are not in a format that the Lemur Toolkit can inherently process: If necessary, extract the text from the document. Wrap the plaintext in TREC-style wrappers: <DOC> <DOCNO> document_id </DOCNO> <TEXT>   Index this document text. </TEXT> </DOC> –  or – For more advanced users, write your own parser to extend the Lemur Toolkit.
Indexing - Parameters Basic usage to build index: Indri BuildIndex <parameter_file> Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.
Indexing – Parameters Standard parameter file specification an XML document: <parameters>   <option></option>   <option></option> …   <option></option> </parameters>
Indexing – Parameters <corpus>  - where to find your source files and what type to expect <path> :  (required) the path to the source files (absolute or relative) <class> :  (optional) the document type to expect. If omitted, IndriBuildIndex will attempt to guess at the filetype based on the file’s extension. <parameters> <corpus> <path> /path/to/source/files </path> <class> trectext </class> </corpus> </parameters>
Indexing - Parameters The  <index>  parameter tells  Indri BuildIndex where to create  or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new documents into the index. <parameters> <index> /path/to/the/index </index> </parameters>
Indexing - Parameters <memory>  - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory> 256M </memory> </parameters>
Indexing - Parameters Stopwords can be defined within a  <stopper>  block with individual stopwords within enclosed in  <word>  tags. <parameters> <stopper> <word> first_word </word> <word> next_word </word> … <word> final_word </word> </stopper> </parameters>
Indexing – Parameters Term stemming can be used while indexing as well via the  <stemmer>  tag. Specify the stemmer type via the  <name>  tag within. Stemmers included with the Lemur Toolkit include the Krovetz Stemmer and the Porter Stemmer. <parameters> <stemmer> <name> krovetz </name> </stemmer> </parameters>
Indexing anchor text Run  harvestlinks  application on your data before indexing <inlink>path-to-links</inlink>  as a parameter to IndriBuildIndex to index
Retrieval Parameters Query Formatting Interpreting Results
Retrieval - Parameters Basic usage for retrieval: IndriRunQuery /RetEval <parameter_file> Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.
Retrieval - Parameters Just as with indexing: A well-formed XML document with options, wrapped by <parameters> tags: <parameters>   <options></options>   <options></options> …   <options></options> </parameters>
Retrieval - Parameters The  <index>  parameter tells  IndriRunQuery /RetEval  where to find the repository. <parameters> <index> /path/to/the/index </index> </parameters>
Retrieval - Parameters The  <query>  parameter specifies a query plain text or using the Indri query language <parameters> <query> <number>1</number> <text> this is the first query </text> </query> <query> <number>2</number> <text> another query to run </text> </query> </parameters>
Retrieval - Parameters A free-text query will be interpreted as using the #combine operator “ this is a query” will be equivalent to “#combine( this is a query )” More on the Indri query language operators in the next section
Retrieval – Query Formatting TREC-style topics are  not  directly able to be processed via  IndriRunQuery /RetEval . Format the queries accordingly: Format by hand Write a script to extract the fields
Retrieval - Parameters As with indexing, the  <memory>  parameter can be used to define a “soft-limit” of the amount of memory the retrieval system uses. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory> 256M </memory> </parameters>
Retrieval - Parameters As with indexing, stopwords can be defined within a  <stopper>  block with individual stopwords within enclosed in  <word>  tags. <parameters> <stopper> <word> first_word </word> <word> next_word </word> … <word> final_word </word> </stopper> </parameters>
Retrieval – Parameters To specify a maximum number of results to return, use the  <count>  tag: <parameters> <count> 50 </count> </parameters>
Retrieval - Parameters Result formatting options: IndriRunQuery /RetEval  has built in formatting specifications for TREC  and INEX  retrieval tasks
Retrieval – Parameters TREC – Formatting directives: <runID> : a string specifying the id for a query run, used in TREC scorable output. <trecFormat> :  true  to produce TREC scorable output, otherwise use  false  (default). <parameters> <runID> runName </runID> <trecFormat> true </trecFormat> </parameters>
Outputting INEX Result Format Must be wrapped in  <inex>  tags <participant-id> : specifies the participant-id attribute used in submissions. <task> : specifies the task attribute (default CO.Thorough). <query> : specifies the query attribute (default automatic). <topic-part> : specifies the topic-part attribute (default T). <description> : specifies the contents of the description tag. <parameters> <inex> <participant-id> LEMUR001 </participant-id> </inex> </parameters>
Retrieval – Interpreting Results The default output from  IndriRunQuery  will return a list of results, 1 result per line, with 4 columns: <score> : the score of the returned document. An Indri query will always return a negative value for a result. <docID> : the document ID <extent_begin> : the starting token number of the extent that was retrieved <extent_end> : the ending token number of the extent that was retrieved
Retrieval – Interpreting Results When executing  IndriRunQuery  with the default formatting options, the output will look something like: <score> <DocID> <extent_begin> <extent_end> -4.83646 AP890101-0001 0 485 -7.06236 AP890101-0015   0 385
Retrieval - Evaluation To use trec_eval:  format  IndriRunQuery  results with appropriate trec_eval formatting directives in the parameter file: <runID>runName</runID> <trecFormat>true</trecFormat> Resulting output will be in standard TREC format ready for evaluation: <queryID> Q0 <DocID> <rank> <score> <runID> 150 Q0 AP890101-0001 1 -4.83646 runName 150 Q0 AP890101-0015 2 -7.06236 runName
Smoothing <rule> method:linear,collectionLambda:0.4,documentLambda:0.2 </rule> <rule> method:dirichlet,mu:1000 </rule> <rule> method:twostage,mu:1500,lambda:0.4 </rule>
Use RetEval for TF.IDF First run  ParseToFile  to convert doc formatted queries into queries <parameters> <docFormat> format </docFormat> <outputFile> filename </outputFile> <stemmer> stemmername </stemmer> <stopwords> stopwordfile </stopwords> </parameters> ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/parsing.html#parsetofile
Use RetEval for TF.IDF Then run  RetEval <parameters> <index> index </index> <retModel> 0 </retModel>  // 0 for TF-IDF, 1 for Okapi,  // 2 for KL-divergence,  // 5 for cosine similarity <textQuery> queries.reteval </textQuery> <resultCount> 1000 </resultCount> <resultFile> tfidf.res </resultFile> </parameters> RetEval paramfile queryfile http:// www.lemurproject.org/lemur/retrieval.html#RetEval
Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
Indri Query Language terms field restriction / evaluation numeric combining beliefs field / passage retrieval filters document priors http://www.lemurproject.org/lemur/IndriQueryLanguage.html
Term Operations name example behavior term dog occurrences of  dog   (Indri will stem and stop) “ term” “ dog” occurrences of  dog   (Indri will not stem or stop) ordered window #od n (blue car) blue   n  words or less before  car unordered window #ud n (blue car) blue  within  n  words of  car synonym list #syn(car automobile) occurrences of  car  or  automobile weighted synonym #wsyn(1.0 car 0.5 automobile) like synonym, but only counts occurrences of  automobile  as 0.5 of an occurrence any operator #any:person all occurrences of  the  person  field
Field Restriction/Evaluation name example behavior restriction dog.title counts only occurrences of  dog  in  title  field dog.title,header counts occurrences of  dog   in  title  or  header evaluation dog.(title) builds belief  b (dog)  using  title  language model dog.(title,header) b (dog)  estimated using language model from concatenation of  all  title  and  header  fields #od1(trevor strohman).person(title) builds a model from all  title  text for b (#od1(trevor strohman).person) - only counts “ trevor strohman ” occurrences in  person  fields
Numeric Operators name example behavior less #less(year 2000) occurrences of  year  field  < 2000 greater #greater(year 2000) year  field  > 2000 between #between(year 1990 2000) 1990 < year  field  < 2000 equals #equals(year 2000) year  field  = 2000
Belief Operations name example behavior combine #combine(dog train) 0.5 log(  b (dog)  ) +  0.5 log(  b (train)  ) weight, wand #weight(1.0 dog 0.5 train) 0.67 log(  b (dog)  ) + 0.33 log(  b (train)  ) wsum #wsum(1.0 dog  0.5 dog.(title)) log( 0.67  b (dog)  + 0.33  b (dog.(title))  ) not #not(dog) log( 1 -  b (dog)  ) max #max(dog train) returns maximum of  b (dog)  and  b (train) or #or(dog cat) log(1 - (1 -  b (dog) ) *  (1 -  b (cat) ))
Field/Passage Retrieval name example behavior field retrieval #combine[title]( query ) return only title fields ranked according to  #combine(query) - beliefs are estimated on each  title ’s language model  -may use any belief node passage retrieval #combine[passage200:100]( query ) dynamically created passages of length  200  created every  100  words are ranked by  #combine(query)
More Field/Passage Retrieval .//field  for ancestor .\field  for parent example behavior #combine[section]( bootstrap  #combine[./title]( methodology )) Rank sections matching  bootstrap  where the section’s  title  also matches  methodology
Filter Operations name example behavior filter require #filreq(elvis #combine(blue shoes)) rank documents that  contain  elvis  by  #combine(blue shoes) filter reject #filrej(shopping #combine(blue shoes)) rank documents that do not contain  shopping  by #combine(blue shoes)
Document Priors RECENT  prior built using  makeprior  application name example behavior prior #combine(#prior(RECENT) global warming) treated as any belief during ranking RECENT  prior could give higher scores to more recent documents
Ad Hoc Retrieval Query likelihood #combine( literacy rates africa ) Rank by P(Q|D) =  Π q  P(q|D)
Query Expansion #weight( 0.75 #combine( literacy rates africa )    0.25 #combine( additional terms ))
Known Entity Search Mixture of multinomials #combine( #wsum( 0.5 bbc.(title)  0.3 bbc.(anchor)  0.2 bbc )  #wsum( 0.5 news.(title)  0.3 news.(anchor)  0.2 news ) ) P(q|D) = 0.5 P(q|title) + 0.3 P(q|anchor) + 0.2 P(q|news)
Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
Indexing Your Data PDF, Word documents, PowerPoint, HTML Use IndriBuildIndex to index your data directly TREC collection Use IndriBuildIndex or BuildIndex Large text corpus Many different options
Indexing Text Corpora Split data into one XML file per document Pro: Easiest option Pro: Use any language you like (Perl, Python) Con: Not very efficient For efficiency, large files are preferred Small files cause internal filesystem fragmentation Small files are harder to open and read efficiently
Indexing: Offset Annotation Tag data does not have to be in the file Add extra tag data using an offset annotation file Format: Example: DOC001 TAG 1 title 10 50 0 0 “ Add a title tag to DOC001 starting at byte 10 and continuing for 50 bytes” docno type id name start length value parent
Indexing Text Corpora Format data in TREC format Pro: Almost as easy as individual XML docs Pro: Use any language you like Con: Not great for online applications Direct news feeds Data comes from a database
Indexing Text Corpora Write your own parser Pro: Fast Pro: Best flexibility, both in integration and in data interpretation Con: Hardest option Con: Smallest language choice (C++ or Java)
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
ParsedDocument struct ParsedDocument { const char* text; size_t textLength; indri::utility::greedy_vector<char*> terms; indri::utility::greedy_vector<indri::parse::TagExtent*> tags; indri::utility::greedy_vector<indri::parse::TermExtent> positions; indri::utility::greedy_vector<indri::parse::MetadataPair> metadata; };
ParsedDocument: Text const char* text; size_t textLength; A null-terminated string of document text Text is compressed and stored in the index for later use (such as snippet generation)
ParsedDocument: Content const char* content; size_t contentLength; A string of document text This is a substring of text; this is used in case the whole text string is not the core document For instance, maybe the text string includes excess XML markup, but the content section is the primary text
ParsedDocument: Terms indri::utility::greedy_vector<char*> terms; document = “My dog has fleas.” terms = { “My”, “dog”, “has”, “fleas” } A list of terms in the document Order matters – word order will be used in term proximity operators A greedy_vector is effectively an STL vector with a different memory allocation policy
ParsedDocument: Terms indri::utility::greedy_vector<char*> terms; Term data will be normalized (downcased, some punctuation removed) later Stopping and stemming can be handled within the indexer Parser’s job is just tokenization
ParsedDocument: Tags indri::utility::greedy_vector<indri::parse::TagExtent*> tags; TagExtent: const char* name; unsigned int begin; unsigned int end; INT64 number; TagExtent *parent; greedy_vector<AttributeValuePair> attributes;
ParsedDocument: Tags name The name of the tag begin, end Word offsets (relative to content) of the beginning and end name of the tag. My  <animal> dirty dog </animal>  has fleas. name = “animal”, begin = 2, end = 3
ParsedDocument: Tags number A numeric component of the tag (optional) sample document This document was written in  <year> 2006 </year> . sample query #between( year 2005 2007 )
ParsedDocument: Tags parent The logical parent of the tag <doc> <par> <sent> My dog still has fleas. </sent>   <sent> My cat does not have fleas. </sent> </par> </doc>
ParsedDocument: Tags attributes Attributes of the tag My  <a href=“index.html”> home page </a> . Note: Indri cannot index tag attributes.  They are used for conflation and extraction purposes only.
ParsedDocument: Tags attributes Attributes of the tag My  <a href=“index.html”> home page </a> . Note: Indri cannot index tag attributes.  They are used for conflation and extraction purposes only.
ParsedDocument: Metadata Metadata is text about a document that should be kept, but not indexed: TREC Document ID (WTX001-B01-00) Document URL Crawl date greedy_vector<indri::parse::MetadataPair> metadata
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
Tag Conflation <ENAMEX TYPE=“ORGANIZATION”>   <ORGANIZATION> <ENAMEX TYPE=“PERSON”> <PERSON>
Indexing Fields Parameters: Name :  name of the XML tag, all lowercase Numeric :  whether this field can be retrieved using the numeric operators, like #between and #less Forward :  true if this field should be efficiently retrievable given the document number See QueryEnvironment::documentMetadata Backward :  true if this document should be retrievable given this field data See QueryEnvironment::documentsFromMetadata
Indexing Fields <parameters> <field> <name>title</name> <backward>true</backward> <field> <field> <name>gradelevel</name> <numeric>true</name> </field> </parameters>
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
dumpindex dumpindex is a versatile and useful tool Use it to explore your data Use it to verify the contents of your index Use it to extract information from the index for use outside of Lemur
dumpindex Extracting the vocabulary % dumpindex ap89 v TOTAL 39192948 84678 the 2432559 84413 of 1063804 83389 to 1006760 82505 a 898999 82712 and 877433 82531 in 873291 82984 said 505578 76240 word  term_count doc_count
dumpindex Extracting a single term  % dumpindex ap89 tp ogilvie ogilvie ogilvie 8 39192948  6056 1 1027 954  11982 1 619 377  15775 1 155 66  45513 3 519 216 275 289  55132 1 668 452  65595 1 514 315  document, count, positions term, stem, count, total_count
dumpindex Extracting a document % dumpindex ap89 dt 5 <DOCNO> AP890101-0005 </DOCNO> <FILEID>AP-NR-01-01-89 0113EST</FILEID> … <TEXT> The Associated Press reported erroneously on Dec. 29 that Sen. James Sasser, D-Tenn., wrote a letter to the chairman of the Federal Home Loan Back Board, M. Danny Wall… </TEXT>
dumpindex Extracting a list of expression matches % dumpindex ap89 e “#1(my dog)” #1(my dog) #1(my dog) 0 0 8270 1 505 507 8270 1 709 711 16291 1 789 791 17596 1 672 674 35425 1 432 434 46265 1 777 779 51954 1 664 666 81574 1 532 534 document, weight, begin, end
Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
Introducing the API Lemur “Classic” API Many objects, highly customizable May want to use this when you want to change how the system works Support for clustering, distributed IR, summarization Indri API Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval, “live” incremental indexing, and parallel retrieval
Indri: IndexEnvironment Most of the time, you will index documents with IndriBuildIndex Using this class is necessary if: you build your own parser, or you want to add documents to an index while queries are running Can be used from C++ or Java
Indri: IndexEnvironment Most important methods: addFile: adds a file of text to the index addString: adds a document (in a text string) to the index addParsedDocument: adds a ParsedDocument structure to the index setIndexedFields: tells the indexer which fields to store in the index
Indri: QueryEnvironment The core of the Indri API Includes methods for: Opening indexes and connecting to query servers Running queries Collecting collection statistics Retrieving document text Can be used from C++, Java, PHP or C#
QueryEnvrionment: Opening Opening methods: addIndex:  opens an index from the local disk addServer: opens a connection to an Indri daemon (IndriDaemon or indrid) Indri treats all open indexes as a single collection Query results will be identical to those you’d get by storing all documents in a single index
QueryEnvironment: Running Running queries: runQuery: runs an Indri query, returns a ranked list of results (can add a document set in order to restrict evaluation to a few documents) runAnnotatedQuery: returns a ranked list of results and a list of all document locations where the query matched something
QueryEnvironment: Retrieving Retrieving document text: documents: returns the full text of a set of documents documentMetadata: returns portions of the document (e.g. just document titles) documentsFromMetadata: returns documents that contain a certain bit of metadata (e.g. a URL) expressionList: an inverted list for a particular Indri query language expression
Lemur “Classic” API Primarily useful for retrieval operations Most indexing work in the toolkit has moved to the Indri API Indri indexes can be used with Lemur “Classic” retrieval applications Extensive documentation and tutorials on the website (more are coming)
Lemur Index Browsing The Lemur API gives access to the index data (e.g. inverted lists, collection statistics) IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and returns the appropriate kind of index class docInfoList (inverted list), termInfoList (document vector), termCount, documentCount
Lemur Index Browsing Index::term term( char* s ) : convert term string to a number term( int id ) : convert term number to a string Index::document document( char* s ) : convert doc string to a number document( int id ) : convert doc number to a string
Lemur Index Browsing Index::termCount termCount() :  Total number of terms indexed termCount( int id ) : Total number of occurrences of term number  id . Index::documentCount docCount() :  Number of documents indexed docCount( int id ) : Number of documents that contain term number  id .
Lemur Index Browsing Index::docLength( int docID ) The length, in number of terms, of document number  docID . Index::docLengthAvg Average indexed document length Index::termCountUnique Size of the index vocabulary
Lemur Index Browsing Index::docLength( int docID ) The length, in number of terms, of document number  docID . Index::docLengthAvg Average indexed document length Index::termCountUnique Size of the index vocabulary
Lemur: DocInfoList Index::docInfoList( int termID ) Returns an iterator to the inverted list for  termID . The list contains all documents that contain termID,  including the positions where  termID occurs.
Lemur: TermInfoList Index::termInfoList( int docID ) Returns an iterator to the direct list for  docID . The list contains term numbers for every term contained in document  docID,  and the number  of times each word occurs. (use termInfoListSeq to get word positions)
Lemur Retrieval Class Name Description TFIDFRetMethod BM25 SimpleKLRetMethod KL-Divergence InQueryRetMethod Simplified InQuery CosSimRetMethod Cosine CORIRetMethod CORI OkapiRetMethod Okapi IndriRetMethod Indri (wraps QueryEnvironment)
Lemur Retrieval RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer
Lemur: Other tasks Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM, DirichletUnigramLM, etc.
Getting Help http://www.lemurproject.org Central website, tutorials, documentation, news http://www.lemurproject.org/phorum Discussion board, developers read and respond to questions http://ciir.cs.umass.edu/~strohman/indri   My own page of Indri tips README file in the code distribution
Concluding: In Review Paul About the toolkit About Language Modeling, IR methods Indexing a TREC collection Running TREC queries Interpreting query results
Concluding: In Review Trevor Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
Questions Ask us questions! What is the best way to do  x ? How do I get started with my particular task? Does the toolkit have the  x  feature? How can I modify the toolkit to do  x ? When do we get coffee?

More Related Content

What's hot (20)

PDF
Search explained T3DD15
Hans Höchtl
 
PPTX
Extracting article text from the web with maximum subsequence segmentation
Jhih-Ming Chen
 
PDF
Python cheat-sheet
srinivasanr281952
 
PPTX
11 Unit 1 Chapter 02 Python Fundamentals
Praveen M Jigajinni
 
PPTX
Ryan hamahashi 10x10 presentation
ronin124
 
DOCX
Python interview questions
Pragati Singh
 
PPTX
An approach to source code plagiarism
varsha_bhat
 
PPT
NetBase API Presentation
Netbase Solutions Inc.
 
PPTX
Open nlp presentationss
Chandan Deb
 
PPTX
NLP and LSA getting started
Innovation Engineering
 
PPT
AdvancedXPath
Suite Solutions
 
PPTX
Java platform
Visithan
 
DOCX
Python interview questions and answers
RojaPriya
 
PPTX
Python Tutorial Part 1
Haitham El-Ghareeb
 
PPTX
Fundamentals of Python Programming
Kamal Acharya
 
PPT
ImplementingChangeTrackingAndFlagging
Suite Solutions
 
PDF
What is Python?
wesley chun
 
PPT
Web search engines
AbdusamadAbdukarimov2
 
PDF
Presentation of OpenNLP
Robert Viseur
 
ODP
Introduction to programming with python
Porimol Chandro
 
Search explained T3DD15
Hans Höchtl
 
Extracting article text from the web with maximum subsequence segmentation
Jhih-Ming Chen
 
Python cheat-sheet
srinivasanr281952
 
11 Unit 1 Chapter 02 Python Fundamentals
Praveen M Jigajinni
 
Ryan hamahashi 10x10 presentation
ronin124
 
Python interview questions
Pragati Singh
 
An approach to source code plagiarism
varsha_bhat
 
NetBase API Presentation
Netbase Solutions Inc.
 
Open nlp presentationss
Chandan Deb
 
NLP and LSA getting started
Innovation Engineering
 
AdvancedXPath
Suite Solutions
 
Java platform
Visithan
 
Python interview questions and answers
RojaPriya
 
Python Tutorial Part 1
Haitham El-Ghareeb
 
Fundamentals of Python Programming
Kamal Acharya
 
ImplementingChangeTrackingAndFlagging
Suite Solutions
 
What is Python?
wesley chun
 
Web search engines
AbdusamadAbdukarimov2
 
Presentation of OpenNLP
Robert Viseur
 
Introduction to programming with python
Porimol Chandro
 

Viewers also liked (20)

PPTX
Info 2402 irt-chapter_2
Shahriar Rafee
 
PPTX
Chap1
Shahriar Rafee
 
PPT
Web search engines ( Mr.Mirza )
Ali Saif Mirza
 
PDF
120521 agile contracts 2.1
Proyectalis / Improvement21
 
PDF
Harvard Business School Presentation "Outsourcing Selling"
Charles Cohon
 
PDF
Agile Contracts by Drew Jemilo (Agile2015)
Drew Jemilo
 
PPTX
Outsourcing ppt
Aparna Ramesh
 
PPT
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Ted Drake
 
PDF
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
Romeo Kienzler
 
PPT
Outsourcing
Jigar mehta
 
PDF
Design Thinking Method Cards (Beta 1.0)
Boris Friedrich Milkowski
 
PDF
Design Thinking Method Sticker 2014
Design Thinking HSG
 
PPT
Scope of electronics and communication engineering.ppt
Rajesh Kumar
 
PPTX
Electronics past,present and future
Rajat Dhiman
 
PPTX
How To Study Effectively
Simmons Marcus
 
PPTX
Design Thinking Method Cards
Design Thinking HSG
 
PDF
Design Thinking: The one thing that will transform the way you think
Digital Surgeons
 
PDF
Kickstarting Design Thinking
Erin 'Folletto' Casali
 
PDF
IDEO's design thinking.
BeeCanvas
 
Info 2402 irt-chapter_2
Shahriar Rafee
 
Web search engines ( Mr.Mirza )
Ali Saif Mirza
 
120521 agile contracts 2.1
Proyectalis / Improvement21
 
Harvard Business School Presentation "Outsourcing Selling"
Charles Cohon
 
Agile Contracts by Drew Jemilo (Agile2015)
Drew Jemilo
 
Outsourcing ppt
Aparna Ramesh
 
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Ted Drake
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
Romeo Kienzler
 
Outsourcing
Jigar mehta
 
Design Thinking Method Cards (Beta 1.0)
Boris Friedrich Milkowski
 
Design Thinking Method Sticker 2014
Design Thinking HSG
 
Scope of electronics and communication engineering.ppt
Rajesh Kumar
 
Electronics past,present and future
Rajat Dhiman
 
How To Study Effectively
Simmons Marcus
 
Design Thinking Method Cards
Design Thinking HSG
 
Design Thinking: The one thing that will transform the way you think
Digital Surgeons
 
Kickstarting Design Thinking
Erin 'Folletto' Casali
 
IDEO's design thinking.
BeeCanvas
 
Ad

Similar to Lemur Tutorial at SIGIR 2006 (20)

PDF
search engines designed to support research on using statistical language models
CorporationMh
 
PPTX
Introduction to Information Retrieval using Lucene
DeeKan3
 
PPT
DB and IR Integration
Marco A Torres
 
PPT
Lucene BootCamp
GokulD
 
PPT
Information Retrieval
ShujaatZaheer3
 
PPT
DB-IR-ranking
FELIX75
 
PPTX
Search enabled applications with lucene.net
Willem Meints
 
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
PPT
Introduction to Search Engines
Nitin Pande
 
PPTX
PyCon India 2012: Rapid development of website search in python
Chetan Giridhar
 
PDF
Apache Solr crash course
Tommaso Teofili
 
PDF
A Practical Introduction to Apache Solr
Angel Borroy López
 
PDF
Search Engine-Building with Lucene and Solr
Kai Chan
 
PDF
Solr search engine with multiple table relation
Jay Bharat
 
PPT
Lucene Bootcamp - 2
GokulD
 
PDF
What is in a Lucene index?
lucenerevolution
 
PDF
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
PPTX
Language Models for Information Retrieval
Dustin Smith
 
PPT
Finite State Queries In Lucene
otisg
 
search engines designed to support research on using statistical language models
CorporationMh
 
Introduction to Information Retrieval using Lucene
DeeKan3
 
DB and IR Integration
Marco A Torres
 
Lucene BootCamp
GokulD
 
Information Retrieval
ShujaatZaheer3
 
DB-IR-ranking
FELIX75
 
Search enabled applications with lucene.net
Willem Meints
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Ecommerce Solution Provider SysIQ
 
Introduction to Search Engines
Nitin Pande
 
PyCon India 2012: Rapid development of website search in python
Chetan Giridhar
 
Apache Solr crash course
Tommaso Teofili
 
A Practical Introduction to Apache Solr
Angel Borroy López
 
Search Engine-Building with Lucene and Solr
Kai Chan
 
Solr search engine with multiple table relation
Jay Bharat
 
Lucene Bootcamp - 2
GokulD
 
What is in a Lucene index?
lucenerevolution
 
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Language Models for Information Retrieval
Dustin Smith
 
Finite State Queries In Lucene
otisg
 
Ad

Recently uploaded (20)

PDF
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
PPTX
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
PDF
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
Fwdays
 
PDF
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PPTX
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
PPTX
Machine Learning Benefits Across Industries
SynapseIndia
 
PPTX
TYPES OF COMMUNICATION Presentation of ICT
JulieBinwag
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
PDF
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
PDF
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
How Current Advanced Cyber Threats Transform Business Operation
Eryk Budi Pratama
 
Extensions Framework (XaaS) - Enabling Orchestrate Anything
ShapeBlue
 
Bitcoin+ Escalando sin concesiones - Parte 1
Fernando Paredes García
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
Fwdays
 
NewMind AI Journal - Weekly Chronicles - July'25 Week II
NewMind AI
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
Darren Mills The Migration Modernization Balancing Act: Navigating Risks and...
AWS Chicago
 
Machine Learning Benefits Across Industries
SynapseIndia
 
TYPES OF COMMUNICATION Presentation of ICT
JulieBinwag
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Human-centred design in online workplace learning and relationship to engagem...
Tracy Tang
 
visibel.ai Company Profile – Real-Time AI Solution for CCTV
visibelaiproject
 
CIFDAQ Market Insight for 14th July 2025
CIFDAQ
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
Top Managed Service Providers in Los Angeles
Captain IT
 

Lemur Tutorial at SIGIR 2006

  • 2. Introductions Paul Ogilvie Trevor Strohman
  • 3. Installation Linux, OS/X: Extract software/lemur-4.3.2.tar.gz ./configure --prefix=/install/path ./make ./make install Windows Run software/lemur-4.3.2-install.exe Documentation in windoc/index.html
  • 4. Overview Background in Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
  • 5. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 6. Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
  • 7. Language Modeling for IR Estimate a multinomial probability distribution from the text Smooth the distribution with one estimated from the entire collection P(w|  D ) = (1-  ) P(w|D)+  P(w|C)
  • 8. Estimate probability that document generated the query terms Query Likelihood ? P(Q|  D ) =  P(q|  D )
  • 9. Kullback-Leibler Divergence Estimate models for document and query and compare = ? KL(  Q |  D ) =  P(w|  Q ) log(P(w|  Q ) / P(w|  D ))
  • 10. Inference Networks Language models used to estimate beliefs of representation nodes q1 q2 qn q3 I d1 d2 d3 di
  • 11. Summary of Ranking Techniques use simple multinomial probability distributions to model vocabulary usage The distributions are smoothed with a collection model to prevent zero probabilities This has an idf-like effect on ranking Documents are ranked through generative or distribution similarity measures Inference networks allow structured queries – beliefs estimated are related to generative probabilities
  • 12. Other Techniques (Pseudo-) Relevance Feedback Relevance Models [Lavrenko 2001] Markov Chains [Lafferty and Zhai 2001] n -Grams [Song and Croft 1999] Term Dependencies [Gao et al 2004, Metzler and Croft 2005]
  • 13. Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
  • 14. Indexing Document Preparation Indexing Parameters Time and Space Requirements
  • 15. Two Index Formats KeyFile Term Positions Metadata Offline Incremental InQuery Query Language Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri Query Languages
  • 16. Indexing – Document Preparation TREC Text TREC Web Plain Text Microsoft Word (*) Microsoft PowerPoint (*) Document Formats: The Lemur Toolkit can inherently deal with several different document format types without any modification: HTML XML PDF Mbox (*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.
  • 17. Indexing – Document Preparation If your documents are not in a format that the Lemur Toolkit can inherently process: If necessary, extract the text from the document. Wrap the plaintext in TREC-style wrappers: <DOC> <DOCNO> document_id </DOCNO> <TEXT> Index this document text. </TEXT> </DOC> – or – For more advanced users, write your own parser to extend the Lemur Toolkit.
  • 18. Indexing - Parameters Basic usage to build index: Indri BuildIndex <parameter_file> Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.
  • 19. Indexing – Parameters Standard parameter file specification an XML document: <parameters> <option></option> <option></option> … <option></option> </parameters>
  • 20. Indexing – Parameters <corpus> - where to find your source files and what type to expect <path> : (required) the path to the source files (absolute or relative) <class> : (optional) the document type to expect. If omitted, IndriBuildIndex will attempt to guess at the filetype based on the file’s extension. <parameters> <corpus> <path> /path/to/source/files </path> <class> trectext </class> </corpus> </parameters>
  • 21. Indexing - Parameters The <index> parameter tells Indri BuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new documents into the index. <parameters> <index> /path/to/the/index </index> </parameters>
  • 22. Indexing - Parameters <memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory> 256M </memory> </parameters>
  • 23. Indexing - Parameters Stopwords can be defined within a <stopper> block with individual stopwords within enclosed in <word> tags. <parameters> <stopper> <word> first_word </word> <word> next_word </word> … <word> final_word </word> </stopper> </parameters>
  • 24. Indexing – Parameters Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag within. Stemmers included with the Lemur Toolkit include the Krovetz Stemmer and the Porter Stemmer. <parameters> <stemmer> <name> krovetz </name> </stemmer> </parameters>
  • 25. Indexing anchor text Run harvestlinks application on your data before indexing <inlink>path-to-links</inlink> as a parameter to IndriBuildIndex to index
  • 26. Retrieval Parameters Query Formatting Interpreting Results
  • 27. Retrieval - Parameters Basic usage for retrieval: IndriRunQuery /RetEval <parameter_file> Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.
  • 28. Retrieval - Parameters Just as with indexing: A well-formed XML document with options, wrapped by <parameters> tags: <parameters> <options></options> <options></options> … <options></options> </parameters>
  • 29. Retrieval - Parameters The <index> parameter tells IndriRunQuery /RetEval where to find the repository. <parameters> <index> /path/to/the/index </index> </parameters>
  • 30. Retrieval - Parameters The <query> parameter specifies a query plain text or using the Indri query language <parameters> <query> <number>1</number> <text> this is the first query </text> </query> <query> <number>2</number> <text> another query to run </text> </query> </parameters>
  • 31. Retrieval - Parameters A free-text query will be interpreted as using the #combine operator “ this is a query” will be equivalent to “#combine( this is a query )” More on the Indri query language operators in the next section
  • 32. Retrieval – Query Formatting TREC-style topics are not directly able to be processed via IndriRunQuery /RetEval . Format the queries accordingly: Format by hand Write a script to extract the fields
  • 33. Retrieval - Parameters As with indexing, the <memory> parameter can be used to define a “soft-limit” of the amount of memory the retrieval system uses. Use K for kilobytes, M for megabytes, and G for gigabytes. <parameters> <memory> 256M </memory> </parameters>
  • 34. Retrieval - Parameters As with indexing, stopwords can be defined within a <stopper> block with individual stopwords within enclosed in <word> tags. <parameters> <stopper> <word> first_word </word> <word> next_word </word> … <word> final_word </word> </stopper> </parameters>
  • 35. Retrieval – Parameters To specify a maximum number of results to return, use the <count> tag: <parameters> <count> 50 </count> </parameters>
  • 36. Retrieval - Parameters Result formatting options: IndriRunQuery /RetEval has built in formatting specifications for TREC and INEX retrieval tasks
  • 37. Retrieval – Parameters TREC – Formatting directives: <runID> : a string specifying the id for a query run, used in TREC scorable output. <trecFormat> : true to produce TREC scorable output, otherwise use false (default). <parameters> <runID> runName </runID> <trecFormat> true </trecFormat> </parameters>
  • 38. Outputting INEX Result Format Must be wrapped in <inex> tags <participant-id> : specifies the participant-id attribute used in submissions. <task> : specifies the task attribute (default CO.Thorough). <query> : specifies the query attribute (default automatic). <topic-part> : specifies the topic-part attribute (default T). <description> : specifies the contents of the description tag. <parameters> <inex> <participant-id> LEMUR001 </participant-id> </inex> </parameters>
  • 39. Retrieval – Interpreting Results The default output from IndriRunQuery will return a list of results, 1 result per line, with 4 columns: <score> : the score of the returned document. An Indri query will always return a negative value for a result. <docID> : the document ID <extent_begin> : the starting token number of the extent that was retrieved <extent_end> : the ending token number of the extent that was retrieved
  • 40. Retrieval – Interpreting Results When executing IndriRunQuery with the default formatting options, the output will look something like: <score> <DocID> <extent_begin> <extent_end> -4.83646 AP890101-0001 0 485 -7.06236 AP890101-0015 0 385
  • 41. Retrieval - Evaluation To use trec_eval: format IndriRunQuery results with appropriate trec_eval formatting directives in the parameter file: <runID>runName</runID> <trecFormat>true</trecFormat> Resulting output will be in standard TREC format ready for evaluation: <queryID> Q0 <DocID> <rank> <score> <runID> 150 Q0 AP890101-0001 1 -4.83646 runName 150 Q0 AP890101-0015 2 -7.06236 runName
  • 42. Smoothing <rule> method:linear,collectionLambda:0.4,documentLambda:0.2 </rule> <rule> method:dirichlet,mu:1000 </rule> <rule> method:twostage,mu:1500,lambda:0.4 </rule>
  • 43. Use RetEval for TF.IDF First run ParseToFile to convert doc formatted queries into queries <parameters> <docFormat> format </docFormat> <outputFile> filename </outputFile> <stemmer> stemmername </stemmer> <stopwords> stopwordfile </stopwords> </parameters> ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/parsing.html#parsetofile
  • 44. Use RetEval for TF.IDF Then run RetEval <parameters> <index> index </index> <retModel> 0 </retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery> queries.reteval </textQuery> <resultCount> 1000 </resultCount> <resultFile> tfidf.res </resultFile> </parameters> RetEval paramfile queryfile http:// www.lemurproject.org/lemur/retrieval.html#RetEval
  • 45. Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
  • 46. Indri Query Language terms field restriction / evaluation numeric combining beliefs field / passage retrieval filters document priors http://www.lemurproject.org/lemur/IndriQueryLanguage.html
  • 47. Term Operations name example behavior term dog occurrences of dog (Indri will stem and stop) “ term” “ dog” occurrences of dog (Indri will not stem or stop) ordered window #od n (blue car) blue n words or less before car unordered window #ud n (blue car) blue within n words of car synonym list #syn(car automobile) occurrences of car or automobile weighted synonym #wsyn(1.0 car 0.5 automobile) like synonym, but only counts occurrences of automobile as 0.5 of an occurrence any operator #any:person all occurrences of the person field
  • 48. Field Restriction/Evaluation name example behavior restriction dog.title counts only occurrences of dog in title field dog.title,header counts occurrences of dog in title or header evaluation dog.(title) builds belief b (dog) using title language model dog.(title,header) b (dog) estimated using language model from concatenation of all title and header fields #od1(trevor strohman).person(title) builds a model from all title text for b (#od1(trevor strohman).person) - only counts “ trevor strohman ” occurrences in person fields
  • 49. Numeric Operators name example behavior less #less(year 2000) occurrences of year field < 2000 greater #greater(year 2000) year field > 2000 between #between(year 1990 2000) 1990 < year field < 2000 equals #equals(year 2000) year field = 2000
  • 50. Belief Operations name example behavior combine #combine(dog train) 0.5 log( b (dog) ) + 0.5 log( b (train) ) weight, wand #weight(1.0 dog 0.5 train) 0.67 log( b (dog) ) + 0.33 log( b (train) ) wsum #wsum(1.0 dog 0.5 dog.(title)) log( 0.67 b (dog) + 0.33 b (dog.(title)) ) not #not(dog) log( 1 - b (dog) ) max #max(dog train) returns maximum of b (dog) and b (train) or #or(dog cat) log(1 - (1 - b (dog) ) * (1 - b (cat) ))
  • 51. Field/Passage Retrieval name example behavior field retrieval #combine[title]( query ) return only title fields ranked according to #combine(query) - beliefs are estimated on each title ’s language model -may use any belief node passage retrieval #combine[passage200:100]( query ) dynamically created passages of length 200 created every 100 words are ranked by #combine(query)
  • 52. More Field/Passage Retrieval .//field for ancestor .\field for parent example behavior #combine[section]( bootstrap #combine[./title]( methodology )) Rank sections matching bootstrap where the section’s title also matches methodology
  • 53. Filter Operations name example behavior filter require #filreq(elvis #combine(blue shoes)) rank documents that contain elvis by #combine(blue shoes) filter reject #filrej(shopping #combine(blue shoes)) rank documents that do not contain shopping by #combine(blue shoes)
  • 54. Document Priors RECENT prior built using makeprior application name example behavior prior #combine(#prior(RECENT) global warming) treated as any belief during ranking RECENT prior could give higher scores to more recent documents
  • 55. Ad Hoc Retrieval Query likelihood #combine( literacy rates africa ) Rank by P(Q|D) = Π q P(q|D)
  • 56. Query Expansion #weight( 0.75 #combine( literacy rates africa ) 0.25 #combine( additional terms ))
  • 57. Known Entity Search Mixture of multinomials #combine( #wsum( 0.5 bbc.(title) 0.3 bbc.(anchor) 0.2 bbc ) #wsum( 0.5 news.(title) 0.3 news.(anchor) 0.2 news ) ) P(q|D) = 0.5 P(q|title) + 0.3 P(q|anchor) + 0.2 P(q|news)
  • 58. Overview Background The Toolkit Language Modeling in Information Retrieval Basic application usage Building an index Running queries Evaluating results Indri query language Coffee break
  • 59. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 60. Indexing Your Data PDF, Word documents, PowerPoint, HTML Use IndriBuildIndex to index your data directly TREC collection Use IndriBuildIndex or BuildIndex Large text corpus Many different options
  • 61. Indexing Text Corpora Split data into one XML file per document Pro: Easiest option Pro: Use any language you like (Perl, Python) Con: Not very efficient For efficiency, large files are preferred Small files cause internal filesystem fragmentation Small files are harder to open and read efficiently
  • 62. Indexing: Offset Annotation Tag data does not have to be in the file Add extra tag data using an offset annotation file Format: Example: DOC001 TAG 1 title 10 50 0 0 “ Add a title tag to DOC001 starting at byte 10 and continuing for 50 bytes” docno type id name start length value parent
  • 63. Indexing Text Corpora Format data in TREC format Pro: Almost as easy as individual XML docs Pro: Use any language you like Con: Not great for online applications Direct news feeds Data comes from a database
  • 64. Indexing Text Corpora Write your own parser Pro: Fast Pro: Best flexibility, both in integration and in data interpretation Con: Hardest option Con: Smallest language choice (C++ or Java)
  • 65. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 66. ParsedDocument struct ParsedDocument { const char* text; size_t textLength; indri::utility::greedy_vector<char*> terms; indri::utility::greedy_vector<indri::parse::TagExtent*> tags; indri::utility::greedy_vector<indri::parse::TermExtent> positions; indri::utility::greedy_vector<indri::parse::MetadataPair> metadata; };
  • 67. ParsedDocument: Text const char* text; size_t textLength; A null-terminated string of document text Text is compressed and stored in the index for later use (such as snippet generation)
  • 68. ParsedDocument: Content const char* content; size_t contentLength; A string of document text This is a substring of text; this is used in case the whole text string is not the core document For instance, maybe the text string includes excess XML markup, but the content section is the primary text
  • 69. ParsedDocument: Terms indri::utility::greedy_vector<char*> terms; document = “My dog has fleas.” terms = { “My”, “dog”, “has”, “fleas” } A list of terms in the document Order matters – word order will be used in term proximity operators A greedy_vector is effectively an STL vector with a different memory allocation policy
  • 70. ParsedDocument: Terms indri::utility::greedy_vector<char*> terms; Term data will be normalized (downcased, some punctuation removed) later Stopping and stemming can be handled within the indexer Parser’s job is just tokenization
  • 71. ParsedDocument: Tags indri::utility::greedy_vector<indri::parse::TagExtent*> tags; TagExtent: const char* name; unsigned int begin; unsigned int end; INT64 number; TagExtent *parent; greedy_vector<AttributeValuePair> attributes;
  • 72. ParsedDocument: Tags name The name of the tag begin, end Word offsets (relative to content) of the beginning and end name of the tag. My <animal> dirty dog </animal> has fleas. name = “animal”, begin = 2, end = 3
  • 73. ParsedDocument: Tags number A numeric component of the tag (optional) sample document This document was written in <year> 2006 </year> . sample query #between( year 2005 2007 )
  • 74. ParsedDocument: Tags parent The logical parent of the tag <doc> <par> <sent> My dog still has fleas. </sent> <sent> My cat does not have fleas. </sent> </par> </doc>
  • 75. ParsedDocument: Tags attributes Attributes of the tag My <a href=“index.html”> home page </a> . Note: Indri cannot index tag attributes. They are used for conflation and extraction purposes only.
  • 76. ParsedDocument: Tags attributes Attributes of the tag My <a href=“index.html”> home page </a> . Note: Indri cannot index tag attributes. They are used for conflation and extraction purposes only.
  • 77. ParsedDocument: Metadata Metadata is text about a document that should be kept, but not indexed: TREC Document ID (WTX001-B01-00) Document URL Crawl date greedy_vector<indri::parse::MetadataPair> metadata
  • 78. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 79. Tag Conflation <ENAMEX TYPE=“ORGANIZATION”> <ORGANIZATION> <ENAMEX TYPE=“PERSON”> <PERSON>
  • 80. Indexing Fields Parameters: Name : name of the XML tag, all lowercase Numeric : whether this field can be retrieved using the numeric operators, like #between and #less Forward : true if this field should be efficiently retrievable given the document number See QueryEnvironment::documentMetadata Backward : true if this document should be retrievable given this field data See QueryEnvironment::documentsFromMetadata
  • 81. Indexing Fields <parameters> <field> <name>title</name> <backward>true</backward> <field> <field> <name>gradelevel</name> <numeric>true</name> </field> </parameters>
  • 82. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 83. dumpindex dumpindex is a versatile and useful tool Use it to explore your data Use it to verify the contents of your index Use it to extract information from the index for use outside of Lemur
  • 84. dumpindex Extracting the vocabulary % dumpindex ap89 v TOTAL 39192948 84678 the 2432559 84413 of 1063804 83389 to 1006760 82505 a 898999 82712 and 877433 82531 in 873291 82984 said 505578 76240 word term_count doc_count
  • 85. dumpindex Extracting a single term % dumpindex ap89 tp ogilvie ogilvie ogilvie 8 39192948 6056 1 1027 954 11982 1 619 377 15775 1 155 66 45513 3 519 216 275 289 55132 1 668 452 65595 1 514 315 document, count, positions term, stem, count, total_count
  • 86. dumpindex Extracting a document % dumpindex ap89 dt 5 <DOCNO> AP890101-0005 </DOCNO> <FILEID>AP-NR-01-01-89 0113EST</FILEID> … <TEXT> The Associated Press reported erroneously on Dec. 29 that Sen. James Sasser, D-Tenn., wrote a letter to the chairman of the Federal Home Loan Back Board, M. Danny Wall… </TEXT>
  • 87. dumpindex Extracting a list of expression matches % dumpindex ap89 e “#1(my dog)” #1(my dog) #1(my dog) 0 0 8270 1 505 507 8270 1 709 711 16291 1 789 791 17596 1 672 674 35425 1 432 434 46265 1 777 779 51954 1 664 666 81574 1 532 534 document, weight, begin, end
  • 88. Overview (part 2) Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 89. Introducing the API Lemur “Classic” API Many objects, highly customizable May want to use this when you want to change how the system works Support for clustering, distributed IR, summarization Indri API Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval, “live” incremental indexing, and parallel retrieval
  • 90. Indri: IndexEnvironment Most of the time, you will index documents with IndriBuildIndex Using this class is necessary if: you build your own parser, or you want to add documents to an index while queries are running Can be used from C++ or Java
  • 91. Indri: IndexEnvironment Most important methods: addFile: adds a file of text to the index addString: adds a document (in a text string) to the index addParsedDocument: adds a ParsedDocument structure to the index setIndexedFields: tells the indexer which fields to store in the index
  • 92. Indri: QueryEnvironment The core of the Indri API Includes methods for: Opening indexes and connecting to query servers Running queries Collecting collection statistics Retrieving document text Can be used from C++, Java, PHP or C#
  • 93. QueryEnvrionment: Opening Opening methods: addIndex: opens an index from the local disk addServer: opens a connection to an Indri daemon (IndriDaemon or indrid) Indri treats all open indexes as a single collection Query results will be identical to those you’d get by storing all documents in a single index
  • 94. QueryEnvironment: Running Running queries: runQuery: runs an Indri query, returns a ranked list of results (can add a document set in order to restrict evaluation to a few documents) runAnnotatedQuery: returns a ranked list of results and a list of all document locations where the query matched something
  • 95. QueryEnvironment: Retrieving Retrieving document text: documents: returns the full text of a set of documents documentMetadata: returns portions of the document (e.g. just document titles) documentsFromMetadata: returns documents that contain a certain bit of metadata (e.g. a URL) expressionList: an inverted list for a particular Indri query language expression
  • 96. Lemur “Classic” API Primarily useful for retrieval operations Most indexing work in the toolkit has moved to the Indri API Indri indexes can be used with Lemur “Classic” retrieval applications Extensive documentation and tutorials on the website (more are coming)
  • 97. Lemur Index Browsing The Lemur API gives access to the index data (e.g. inverted lists, collection statistics) IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and returns the appropriate kind of index class docInfoList (inverted list), termInfoList (document vector), termCount, documentCount
  • 98. Lemur Index Browsing Index::term term( char* s ) : convert term string to a number term( int id ) : convert term number to a string Index::document document( char* s ) : convert doc string to a number document( int id ) : convert doc number to a string
  • 99. Lemur Index Browsing Index::termCount termCount() : Total number of terms indexed termCount( int id ) : Total number of occurrences of term number id . Index::documentCount docCount() : Number of documents indexed docCount( int id ) : Number of documents that contain term number id .
  • 100. Lemur Index Browsing Index::docLength( int docID ) The length, in number of terms, of document number docID . Index::docLengthAvg Average indexed document length Index::termCountUnique Size of the index vocabulary
  • 101. Lemur Index Browsing Index::docLength( int docID ) The length, in number of terms, of document number docID . Index::docLengthAvg Average indexed document length Index::termCountUnique Size of the index vocabulary
  • 102. Lemur: DocInfoList Index::docInfoList( int termID ) Returns an iterator to the inverted list for termID . The list contains all documents that contain termID, including the positions where termID occurs.
  • 103. Lemur: TermInfoList Index::termInfoList( int docID ) Returns an iterator to the direct list for docID . The list contains term numbers for every term contained in document docID, and the number of times each word occurs. (use termInfoListSeq to get word positions)
  • 104. Lemur Retrieval Class Name Description TFIDFRetMethod BM25 SimpleKLRetMethod KL-Divergence InQueryRetMethod Simplified InQuery CosSimRetMethod Cosine CORIRetMethod CORI OkapiRetMethod Okapi IndriRetMethod Indri (wraps QueryEnvironment)
  • 105. Lemur Retrieval RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer
  • 106. Lemur: Other tasks Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM, DirichletUnigramLM, etc.
  • 107. Getting Help http://www.lemurproject.org Central website, tutorials, documentation, news http://www.lemurproject.org/phorum Discussion board, developers read and respond to questions http://ciir.cs.umass.edu/~strohman/indri My own page of Indri tips README file in the code distribution
  • 108. Concluding: In Review Paul About the toolkit About Language Modeling, IR methods Indexing a TREC collection Running TREC queries Interpreting query results
  • 109. Concluding: In Review Trevor Indexing your own data Using ParsedDocument Indexing document fields Using dumpindex Using the Indri and classic Lemur APIs Getting help
  • 110. Questions Ask us questions! What is the best way to do x ? How do I get started with my particular task? Does the toolkit have the x feature? How can I modify the toolkit to do x ? When do we get coffee?

Editor's Notes

  • #15: More information can be found at: http://www.lemurproject.org/tutorials/begin_indexing-1.html
  • #16: Blue text indicates settings specific to Indri applications. Refer to online documentation for other Lemur application parameters.
  • #18: Could also work with plain text documents
  • #21: Put link into web page for document classes
  • #31: Run a simple query now from the command line
  • #38: Command line options –runID=runName
  • #42: Run canned queries now! Evaluate using trec_eval
  • #44: &lt;parameters&gt; &lt;docFormat&gt;web&lt;/docFormat&gt; &lt;outputFile&gt;queries.reteval&lt;/outputFile&gt; &lt;stemmer&gt;krovetz&lt;/stemmer&gt; &lt;/parameters&gt;
  • #45: &lt;parameters&gt; &lt;index&gt;index&lt;/index&gt; &lt;retModel&gt;0&lt;/retModel&gt; &lt;textQuery&gt;queries.reteval&lt;/textQuery&gt; &lt;resultCount&gt;1000&lt;/resultCount&gt; &lt;resultFile&gt;tfidf.res&lt;/resultFile&gt; &lt;/parameters&gt;
  • #51: Indri normalizes weights to sum to 1.
  • #53: Better example and mention filed hierarchy at index time