SlideShare a Scribd company logo
Introduction to Open Source Search with Apache Lucene and SolrGrant Ingersoll
The How Many GameHow many of you:Have taken a class in Information Retrieval (IR)?Are doing work/research in IR?Have heard of or are using Lucene?Have heard of or are using Solr?Are doing work on core IR algorithms such as compression techniques or scoring?Are doing UI/Application work/research as they relate to search?
TopicsBrief BioSearch 101 (skip?)What is:Apache LuceneApache SolrWhat can they do?Features and functionalityIntangiblesWhat’s new in Lucene and Solr?How can they help my research/work/____?
Brief BioApache Lucene/Solr CommitterApache Mahout co-founderScalable Machine LearningCo-founder of Lucid Imaginationhttp://www.lucidimagination.comPreviously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. LiddyCo-Author of upcoming “Taming Text” (Manning Publications)http://www.manning.com/ingersoll
Search 101Search tools are designed for dealing with fuzzy data/questionsWorks well with structured and unstructured dataPerforms well when dealing with large volumes of dataMany apps don’t need the limits that databases place on contentSearch fits well alongside a DB tooGiven a user’s information need, (query) find and, optionally, score content relevant to that needMany different ways to solve this problem, each with tradeoffsWhat’s “relevant” mean?
Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSMSearch 101RelevanceIndexingFinds  and maps terms and documents Conceptually similar to a book indexAt the heart of fast search/retrieve
Apache Lucene in a Nutshellhttp://lucene.apache.org/javaJava based Application Programming Interface (API) for adding search and indexing functionality to applicationsFast and efficient scoring and indexing algorithmsLots of contributions to make common tasks easier:Highlighting, spatial, Query Parsers, Benchmarking tools, etc.Most widely deployed search library on the planet
Lucene BasicsContent is modeled via Documents and FieldsContent can be text, integers, floats, dates, customAnalysis can be employed to alter content before indexingSearches are supported through a wide range of Query optionsKeywordTermsPhrasesWildcardsMany, many more
Apache Solr in a Nutshellhttp://lucene.apache.org/solrLucene-based Search Server + other features and functionalityAccess Lucene over HTTP:Java, XML, Ruby, Python, .NET, JSON, PHP, etc.Most programming tasks in Lucene are configuration tasks in SolrFaceting (guided navigation, filters, etc.)Replication and distributed search supportLucene Best Practices
A small sampling of Lucene/Solr-Powered Sites10Buy.com
Features and Functionality
Quick Solr/Lucene DemoPre-reqs:Apache Ant 1.7.x, Subversion (SVN)Command Line 1:svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunkcdsolr-trunk/solr/ant examplecd examplejava –Dsolr.clustering.enabled=true –jar start.jarCommand Line 2cd exampledocs; java –jar post.jar *.xmlhttp://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
Other FeaturesData Import HandlerDatabase, Mail, RSS, etc.Rich document support via Apache TikaPDF, MS Office, Images, etc.Replication for high query volumeDistributed search for large indexesProduction systems with 1B+ documentsConfigurable Analysis chain and other extension pointsTotal control over tokenization, stemming, etc.
IntangiblesOpen SourceFlexible, non-restrictive licenseApache License v2 – non-viral“Do what you want with the software, just don’t claim you wrote it”Large community willing to helpGreat place to learn about real world IR systemsMany books and other documentationLucene in Action by Hatcher, McCandless and Gospodnetic
What’s New?https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txthttps://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txtCodecsPluggable Index FormatsProvide Different index compression techniquesStats to enable alternate scoring approaches BM25, Lang. Modeling, etc.  -- More work to be done hereFasterJava Strings are slow; convert to use byte arrays
Other New ItemsMany new Analyzers (tokenizers, etc.)Richer Language support (Hindi, Indonesian, Arabic, …)Richer Geospatial (Local) Search capabilitiesScore, filter, sort by distancehttp://wiki.apache.org/solr/SpatialSearchResults GroupingGroup Related Resultshttp://wiki.apache.org/solr/FieldCollapsingMore Faceting CapabilitiesPivotNew underlying algorithms
How can Lucene/Solr help me?
Job Trendshttp://www.indeed.com
Other Things that Can HelpNutchCrawlinghttp://nutch.apache.orgMahoutMachine learning (clustering, classification, others)http://mahout.apache.orgOpenNLPPart of Speech, Parsers, Named Entity Recognitionhttp://incubator.apache.org/opennlpOpen Relevance ProjectRelevance Judgmentshttp://lucene.apache.org/openrelevance
Resourceshttp://lucene.apache.orghttp://www.lucidimagination.com{java-user|solr-user}@lucene.apache.org@gsingershttp://www.slideshare.net/gsingersgrant@lucidimagination.com
Ad

More Related Content

What's hot (20)

Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Solr 101
Solr 101Solr 101
Solr 101
Findwise
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
Nitin Pande
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
David Smiley
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
th0masr
 
Azure search
Azure searchAzure search
Azure search
Alexej Sommer
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
Atlogys Technical Consulting
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Alexandre Rafalovitch
 
Elasticsearch speed is key
Elasticsearch speed is keyElasticsearch speed is key
Elasticsearch speed is key
Enterprise Search Warsaw Meetup
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
Lucky Sharma
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
sagar chaturvedi
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
David Smiley
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
th0masr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
israelekpo
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
sagar chaturvedi
 

Viewers also liked (8)

Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandler
Dikshant Shahi
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
'Moinuddin Ahmed
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
therealgaston
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
Kai Chan
 
Solr installation
Solr installationSolr installation
Solr installation
ZHAO Sam
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4
Paul Hampton
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence Architecture
MongoDB
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
Solr data importhandler
Solr data importhandlerSolr data importhandler
Solr data importhandler
Dikshant Shahi
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
'Moinuddin Ahmed
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
therealgaston
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
Kai Chan
 
Solr installation
Solr installationSolr installation
Solr installation
ZHAO Sam
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4
Paul Hampton
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence Architecture
MongoDB
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
Ad

Similar to Intro to Apache Lucene and Solr (20)

TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
Francisco Gonçalves
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
Shalin Shekhar Mangar
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
BeyondTrees
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Oto Brglez
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
Shay Sofer
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
Alex Kursov
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engine
NAILBITER
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
Mustafa Elkhiat
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
apache solr web development.pdf
apache solr web development.pdfapache solr web development.pdf
apache solr web development.pdf
Tasnim Jahan
 
963
963963
963
Annu Ahmed
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
Chetan Giridhar
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
Shalin Shekhar Mangar
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
BeyondTrees
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
JavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and SearchingJavaEdge09 : Java Indexing and Searching
JavaEdge09 : Java Indexing and Searching
Shay Sofer
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
Alex Kursov
 
Develop open source search engine
Develop open source search engineDevelop open source search engine
Develop open source search engine
NAILBITER
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
Mustafa Elkhiat
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
apache solr web development.pdf
apache solr web development.pdfapache solr web development.pdf
apache solr web development.pdf
Tasnim Jahan
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
Chetan Giridhar
 
Ad

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
Grant Ingersoll
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Grant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 

Recently uploaded (20)

Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 

Intro to Apache Lucene and Solr

  • 1. Introduction to Open Source Search with Apache Lucene and SolrGrant Ingersoll
  • 2. The How Many GameHow many of you:Have taken a class in Information Retrieval (IR)?Are doing work/research in IR?Have heard of or are using Lucene?Have heard of or are using Solr?Are doing work on core IR algorithms such as compression techniques or scoring?Are doing UI/Application work/research as they relate to search?
  • 3. TopicsBrief BioSearch 101 (skip?)What is:Apache LuceneApache SolrWhat can they do?Features and functionalityIntangiblesWhat’s new in Lucene and Solr?How can they help my research/work/____?
  • 4. Brief BioApache Lucene/Solr CommitterApache Mahout co-founderScalable Machine LearningCo-founder of Lucid Imaginationhttp://www.lucidimagination.comPreviously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. LiddyCo-Author of upcoming “Taming Text” (Manning Publications)http://www.manning.com/ingersoll
  • 5. Search 101Search tools are designed for dealing with fuzzy data/questionsWorks well with structured and unstructured dataPerforms well when dealing with large volumes of dataMany apps don’t need the limits that databases place on contentSearch fits well alongside a DB tooGiven a user’s information need, (query) find and, optionally, score content relevant to that needMany different ways to solve this problem, each with tradeoffsWhat’s “relevant” mean?
  • 6. Vector Space Model (VSM) for relevanceCommon across many search enginesApache Lucene is a highly optimized implementation of the VSMSearch 101RelevanceIndexingFinds and maps terms and documents Conceptually similar to a book indexAt the heart of fast search/retrieve
  • 7. Apache Lucene in a Nutshellhttp://lucene.apache.org/javaJava based Application Programming Interface (API) for adding search and indexing functionality to applicationsFast and efficient scoring and indexing algorithmsLots of contributions to make common tasks easier:Highlighting, spatial, Query Parsers, Benchmarking tools, etc.Most widely deployed search library on the planet
  • 8. Lucene BasicsContent is modeled via Documents and FieldsContent can be text, integers, floats, dates, customAnalysis can be employed to alter content before indexingSearches are supported through a wide range of Query optionsKeywordTermsPhrasesWildcardsMany, many more
  • 9. Apache Solr in a Nutshellhttp://lucene.apache.org/solrLucene-based Search Server + other features and functionalityAccess Lucene over HTTP:Java, XML, Ruby, Python, .NET, JSON, PHP, etc.Most programming tasks in Lucene are configuration tasks in SolrFaceting (guided navigation, filters, etc.)Replication and distributed search supportLucene Best Practices
  • 10. A small sampling of Lucene/Solr-Powered Sites10Buy.com
  • 12. Quick Solr/Lucene DemoPre-reqs:Apache Ant 1.7.x, Subversion (SVN)Command Line 1:svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunkcdsolr-trunk/solr/ant examplecd examplejava –Dsolr.clustering.enabled=true –jar start.jarCommand Line 2cd exampledocs; java –jar post.jar *.xmlhttp://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
  • 13. Other FeaturesData Import HandlerDatabase, Mail, RSS, etc.Rich document support via Apache TikaPDF, MS Office, Images, etc.Replication for high query volumeDistributed search for large indexesProduction systems with 1B+ documentsConfigurable Analysis chain and other extension pointsTotal control over tokenization, stemming, etc.
  • 14. IntangiblesOpen SourceFlexible, non-restrictive licenseApache License v2 – non-viral“Do what you want with the software, just don’t claim you wrote it”Large community willing to helpGreat place to learn about real world IR systemsMany books and other documentationLucene in Action by Hatcher, McCandless and Gospodnetic
  • 15. What’s New?https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txthttps://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txtCodecsPluggable Index FormatsProvide Different index compression techniquesStats to enable alternate scoring approaches BM25, Lang. Modeling, etc. -- More work to be done hereFasterJava Strings are slow; convert to use byte arrays
  • 16. Other New ItemsMany new Analyzers (tokenizers, etc.)Richer Language support (Hindi, Indonesian, Arabic, …)Richer Geospatial (Local) Search capabilitiesScore, filter, sort by distancehttp://wiki.apache.org/solr/SpatialSearchResults GroupingGroup Related Resultshttp://wiki.apache.org/solr/FieldCollapsingMore Faceting CapabilitiesPivotNew underlying algorithms
  • 19. Other Things that Can HelpNutchCrawlinghttp://nutch.apache.orgMahoutMachine learning (clustering, classification, others)http://mahout.apache.orgOpenNLPPart of Speech, Parsers, Named Entity Recognitionhttp://incubator.apache.org/opennlpOpen Relevance ProjectRelevance Judgmentshttp://lucene.apache.org/openrelevance

Editor's Notes

  • #12: Rather than talk you through a lot of the features and functionality, let me show you
  • #13: Do thisExample Queries:ipod184-pin DDRCover: Querying, scoring, faceting, clustering, function queries, spatial, grouping, more like this, indexing