Intro to Apache Lucene and Solr

Introduction to Open Source Search with Apache Lucene and SolrGrant Ingersoll

The How Many GameHow many of you:Have taken a class in Information Retrieval (IR)?Are doing work/research in IR?Have heard of or are using Lucene?Have heard of or are using Solr?Are doing work on core IR algorithms such as compression techniques or scoring?Are doing UI/Application work/research as they relate to search?

TopicsBrief BioSearch 101 (skip?)What is:Apache LuceneApache SolrWhat can they do?Features and functionalityIntangiblesWhat’s new in Lucene and Solr?How can they help my research/work/____?

Brief BioApache Lucene/Solr CommitterApache Mahout co-founderScalable Machine LearningCo-founder of Lucid Imaginationhttp://www.lucidimagination.comPreviously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. LiddyCo-Author of upcoming “Taming Text” (Manning Publications)http://www.manning.com/ingersoll

Search 101Search tools are designed for dealing with fuzzy data/questionsWorks well with structured and unstructured dataPerforms well when dealing with large volumes of dataMany apps don’t need the limits that databases place on contentSearch fits well alongside a DB tooGiven a user’s information need, (query) find and, optionally, score content relevant to that needMany different ways to solve this problem, each with tradeoffsWhat’s “relevant” mean?

Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSMSearch 101RelevanceIndexingFinds and maps terms and documents Conceptually similar to a book indexAt the heart of fast search/retrieve

Apache Lucene in a Nutshellhttp://lucene.apache.org/javaJava based Application Programming Interface (API) for adding search and indexing functionality to applicationsFast and efficient scoring and indexing algorithmsLots of contributions to make common tasks easier:Highlighting, spatial, Query Parsers, Benchmarking tools, etc.Most widely deployed search library on the planet

Lucene BasicsContent is modeled via Documents and FieldsContent can be text, integers, floats, dates, customAnalysis can be employed to alter content before indexingSearches are supported through a wide range of Query optionsKeywordTermsPhrasesWildcardsMany, many more

Apache Solr in a Nutshellhttp://lucene.apache.org/solrLucene-based Search Server + other features and functionalityAccess Lucene over HTTP:Java, XML, Ruby, Python, .NET, JSON, PHP, etc.Most programming tasks in Lucene are configuration tasks in SolrFaceting (guided navigation, filters, etc.)Replication and distributed search supportLucene Best Practices

A small sampling of Lucene/Solr-Powered Sites10Buy.com

Quick Solr/Lucene DemoPre-reqs:Apache Ant 1.7.x, Subversion (SVN)Command Line 1:svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunkcdsolr-trunk/solr/ant examplecd examplejava –Dsolr.clustering.enabled=true –jar start.jarCommand Line 2cd exampledocs; java –jar post.jar *.xmlhttp://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true

Other FeaturesData Import HandlerDatabase, Mail, RSS, etc.Rich document support via Apache TikaPDF, MS Office, Images, etc.Replication for high query volumeDistributed search for large indexesProduction systems with 1B+ documentsConfigurable Analysis chain and other extension pointsTotal control over tokenization, stemming, etc.

IntangiblesOpen SourceFlexible, non-restrictive licenseApache License v2 – non-viral“Do what you want with the software, just don’t claim you wrote it”Large community willing to helpGreat place to learn about real world IR systemsMany books and other documentationLucene in Action by Hatcher, McCandless and Gospodnetic

What’s New?https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txthttps://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txtCodecsPluggable Index FormatsProvide Different index compression techniquesStats to enable alternate scoring approaches BM25, Lang. Modeling, etc. -- More work to be done hereFasterJava Strings are slow; convert to use byte arrays

Other New ItemsMany new Analyzers (tokenizers, etc.)Richer Language support (Hindi, Indonesian, Arabic, …)Richer Geospatial (Local) Search capabilitiesScore, filter, sort by distancehttp://wiki.apache.org/solr/SpatialSearchResults GroupingGroup Related Resultshttp://wiki.apache.org/solr/FieldCollapsingMore Faceting CapabilitiesPivotNew underlying algorithms

Job Trendshttp://www.indeed.com

Other Things that Can HelpNutchCrawlinghttp://nutch.apache.orgMahoutMachine learning (clustering, classification, others)http://mahout.apache.orgOpenNLPPart of Speech, Parsers, Named Entity Recognitionhttp://incubator.apache.org/opennlpOpen Relevance ProjectRelevance Judgmentshttp://lucene.apache.org/openrelevance

Resourceshttp://lucene.apache.orghttp://www.lucidimagination.com{java-user|solr-user}@lucene.apache.org@gsingershttp://www.slideshare.net/gsingersgrant@lucidimagination.com

Intro to Apache Lucene and Solr

Recommended

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Intro to Apache Lucene and Solr (20)

More from Grant Ingersoll (20)

Recently uploaded (20)

Intro to Apache Lucene and Solr

Editor's Notes