SlideShare a Scribd company logo
What’s new in Lucene and Solr?
Grant Ingersoll
CTO, LucidWorks
Lucene/Solr Committer
Sink or Swim?
Search is good for…
• Traditional: Fast, fuzzy text matching across a large document
collection
• De-normalized data
– “light” relational
• Top N problems
– Key-value (top 1)
– Recommendations, “Good enough” classification, clustering
• Faceting, slicing and dicing of numerical/enumerated data
• Spatial, spell checking, record linkage, highlighting
• NoSQL
What’s New?
• Community
• Lucene
• Solr
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional search
engine usages
– Object stores, Record linkage, Social, mobile -> web
• “The Apache Way”
– Meritocracy – Those who do, decide!
• Always Be Testing
– Randomized system tests are all the rage
– http://vimeo.com/32087114
• Patches Welcome!
Acceleration!
Coming Soon: Lucene and Solr 4.8
Java 1.7
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
– Per segment
– FieldCache can be controlled to only load new segments
– Soft commit -- faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
– Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• Automatic compression of stored fields and term vectors
• String -> BytesRef
– Much improved data structure
– … means less memory and less garbage collection effort
Lucene: Flexibility
• Flexible Index Formats
– New posting list codecs: Block, Simple Text, HDFS, etc.
– Pulsing codec: improves performance of primary key searches, inlining
docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
– Decoupled from TF/IDF
– Built in alternatives include BM25 & DFR, and others
• http://en.wikipedia.org/wiki/Okapi_BM25
• http://terrier.org/docs/v3.5/dfr_description.html
– Add your own
FS(A|T)
• Keys:
– byte[] – write-once
– Linear time build of min. automata
– Compression, Reverse lookups
– Weights (used for auto-suggest)
– Pluggable Algebra
• Uses:
– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
– http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0
– “Smaller Representation of Finite State Automata”
• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807,
2011, pp. 118—192.
Grab Bag
• Lots of new suggesters
– Available in Solr
• Doc Values
– Column oriented store
– Numeric and binary variants are updatable (coming to Solr soon)
• Overhauled term vectors APIs
– Now look a lot like Terms
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Solr 4: New Features
• Search/Faceting/Relevance
– New Relevance Function Queries (tf, df, others)
– Pivot Faceting
– Pseudo-join
– Improved Spatial (more later)
– Full support for Lucene Codecs, pluggable scoring
• Indexing
– New Update Processors, including scripting option
– Near real time
• Schema and Config APIs + Schemaless
• Cursors (aka Deep Paging)
• Admin UI
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
– "geo”:”43.17614,-90.57341”
– “geo”:”Circle(4.56,1.23 d=0.0710)”
– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10
30)))”
Scaling Solr
• Distributed/sharded indexing & search
– Auto distributes updates and queries to appropriate shards
– Near Real Time (NRT) indexing capable
– Document routing extensions
• Dynamically scalable
– New SolrCloud instances add indexing and query capacity
– Supports re-balancing (shard-splitting)
• Reliable
– No single point of failure
– Transactions logged
– Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
Solr as NoSQL
• Non-traditional data stores
• Not designed for SQL type queries
• Distributed fault tolerant architecture
• Document oriented, data format agnostic (JSON, XML, CSV, binary)
Go Deep!
APIs
• New APIs for Schema and Solr Config
– XML becoming more of an implementation detail
• Managed Schema mode
• Data-driven schema (aka schemaless)
• Synonyms, stopwords, request handlers
Beyond Solr: LucidWorks Open Source
• Effortless AWS deployment and monitoring:
http://www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/
Lucene and Solr, different file formats, pipelines, Logstash
Summary
• Lucene/Solr 4.x:
– Faster
– More Flexible
– Easier than ever scaling
– More reliable than ever
• Go forth and rank!
Resources
• Me
– grant@lucidworks.com
– @gsingers on Twitter
• LucidWorks
– http://www.lucidworks.com
– http://www.lucidworks.com/support-services/ask-the-experts/

More Related Content

PDF
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
PPTX
Taming Text
PDF
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
KEY
Lucene intro
PDF
Turning a Search Engine into a Relational Database
PDF
How Solr Search Works
PDF
Data Science with Solr and Spark
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Taming Text
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Lucene intro
Turning a Search Engine into a Relational Database
How Solr Search Works
Data Science with Solr and Spark

What's hot (20)

PPTX
Intro to elasticsearch
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
PDF
Solr: 4 big features
PDF
Intro to Apache Solr
ODP
Elastic search
PPTX
Introduction to Apache Solr
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
PDF
Flexible search in Apache Jackrabbit Oak
PDF
Apache Solr crash course
PDF
Data Engineering with Solr and Spark
KEY
State-of-the-Art Drupal Search with Apache Solr
PPTX
ElasticSearch in Production: lessons learned
PDF
Apache Lucene
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PPTX
Introduction to Lucene & Solr and Usecases
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PPTX
Elasticsearch as a search alternative to a relational database
PDF
Retrieving Information From Solr
PPTX
Introduction to Apache Lucene/Solr
Intro to elasticsearch
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Solr: 4 big features
Intro to Apache Solr
Elastic search
Introduction to Apache Solr
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Flexible search in Apache Jackrabbit Oak
Apache Solr crash course
Data Engineering with Solr and Spark
State-of-the-Art Drupal Search with Apache Solr
ElasticSearch in Production: lessons learned
Apache Lucene
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Introduction to Lucene & Solr and Usecases
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Parallel SQL and Streaming Expressions in Apache Solr 6
Elasticsearch as a search alternative to a relational database
Retrieving Information From Solr
Introduction to Apache Lucene/Solr
Ad

Viewers also liked (20)

PPTX
Solr At AOL, Presented by Sean Timm at SolrExchage DC
PDF
Dawid Weiss- Finite state automata in lucene
PPT
Finite State Queries In Lucene
PPTX
Sample2
PDF
Practical Search with Solr: Beyond just Looking it Up
PPT
Jonh Lennon
PPTX
Creating Custom Finishes
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
PPTX
Lucy in the sky[1]
PDF
What’s new in apache solr 1.4
PPT
Presentation
PPT
Zombie
PDF
What’s New in Apache Lucene 2.9
PPT
Mains aux fleurs
PDF
Using Solr to find the Right Person for the Right Job
PDF
What’s New in Apache Lucene 3.0
PDF
All Data Big and Small
PDF
IAMAS 2010 First presentation
PDF
What Lucene and Solr Open Source Search can do for Enterprise Search
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Dawid Weiss- Finite state automata in lucene
Finite State Queries In Lucene
Sample2
Practical Search with Solr: Beyond just Looking it Up
Jonh Lennon
Creating Custom Finishes
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Lucy in the sky[1]
What’s new in apache solr 1.4
Presentation
Zombie
What’s New in Apache Lucene 2.9
Mains aux fleurs
Using Solr to find the Right Person for the Right Job
What’s New in Apache Lucene 3.0
All Data Big and Small
IAMAS 2010 First presentation
What Lucene and Solr Open Source Search can do for Enterprise Search
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Ad

Similar to What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC (20)

PPTX
What's new in solr june 2014
PPTX
Open Source Search FTW
PPTX
Data IO: Next Generation Search with Lucene and Solr 4
PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
PPTX
What's new in Lucene and Solr 4.x
PDF
Solr 4
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PDF
What's New in Solr 3.x / 4.0
PDF
What's new in Solr 5.0
PDF
Lucene's Latest (for Libraries)
PDF
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
PDF
Lucene/Solr 8: The next major release
PDF
PPTX
Apache solr
PDF
Webinar: Inside Apache Solr 5
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
PPTX
Solr Introduction
PDF
What’s New in Solr 1.4
What's new in solr june 2014
Open Source Search FTW
Data IO: Next Generation Search with Lucene and Solr 4
Oslo Solr MeetUp March 2012 - Solr4 alpha
What's new in Lucene and Solr 4.x
Solr 4
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
What's New in Solr 3.x / 4.0
What's new in Solr 5.0
Lucene's Latest (for Libraries)
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
Lucene/Solr 8: The next major release
Apache solr
Webinar: Inside Apache Solr 5
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Solr Introduction
What’s New in Solr 1.4

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
PDF
The Data-Driven Paradigm
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
PPTX
Building a data driven search application with LucidWorks SiLK
PPTX
Introducing LucidWorks App for Splunk Enterprise webinar
PDF
Solr4 nosql search_server_2013
PPTX
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
PDF
Seeley yonik solr performance key innovations
PDF
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Integrating Hadoop & Solr
The Data-Driven Paradigm
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Building a data driven search application with LucidWorks SiLK
Introducing LucidWorks App for Splunk Enterprise webinar
Solr4 nosql search_server_2013
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Seeley yonik solr performance key innovations
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise

Recently uploaded (20)

PDF
Enable Enterprise-Ready Security on IBM i Systems.pdf
PDF
This slide provides an overview Technology
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
PDF
REPORT: Heating appliances market in Poland 2024
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
creating-agentic-ai-solutions-leveraging-aws.pdf
Enable Enterprise-Ready Security on IBM i Systems.pdf
This slide provides an overview Technology
madgavkar20181017ppt McKinsey Presentation.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dell Pro 14 Plus: Be better prepared for what’s coming
Smarter Business Operations Powered by IoT Remote Monitoring
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
NewMind AI Weekly Chronicles - July'25 - Week IV
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
Understanding_Digital_Forensics_Presentation.pptx
GamePlan Trading System Review: Professional Trader's Honest Take
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
REPORT: Heating appliances market in Poland 2024
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
creating-agentic-ai-solutions-leveraging-aws.pdf

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

  • 1. What’s new in Lucene and Solr? Grant Ingersoll CTO, LucidWorks Lucene/Solr Committer
  • 3. Search is good for… • Traditional: Fast, fuzzy text matching across a large document collection • De-normalized data – “light” relational • Top N problems – Key-value (top 1) – Recommendations, “Good enough” classification, clustering • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting • NoSQL
  • 5. Relax, You’re Among Friends • Large, diverse search community with many non-traditional search engine usages – Object stores, Record linkage, Social, mobile -> web • “The Apache Way” – Meritocracy – Those who do, decide! • Always Be Testing – Randomized system tests are all the rage – http://vimeo.com/32087114 • Patches Welcome!
  • 7. Coming Soon: Lucene and Solr 4.8 Java 1.7
  • 9. Lucene: Speed and Memory • Native Near Real Time (NRT) support – Per segment – FieldCache can be controlled to only load new segments – Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) – Faster more consistent index speed • Faster fuzzy & wildcard query processing • Automatic compression of stored fields and term vectors • String -> BytesRef – Much improved data structure – … means less memory and less garbage collection effort
  • 10. Lucene: Flexibility • Flexible Index Formats – New posting list codecs: Block, Simple Text, HDFS, etc. – Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring – Decoupled from TF/IDF – Built in alternatives include BM25 & DFR, and others • http://en.wikipedia.org/wiki/Okapi_BM25 • http://terrier.org/docs/v3.5/dfr_description.html – Add your own
  • 11. FS(A|T) • Keys: – byte[] – write-once – Linear time build of min. automata – Compression, Reverse lookups – Weights (used for auto-suggest) – Pluggable Algebra • Uses: – Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others – FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0 – “Smaller Representation of Finite State Automata” • Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  • 12. Grab Bag • Lots of new suggesters – Available in Solr • Doc Values – Column oriented store – Numeric and binary variants are updatable (coming to Solr soon) • Overhauled term vectors APIs – Now look a lot like Terms
  • 14. Solr 4: New Features • Search/Faceting/Relevance – New Relevance Function Queries (tf, df, others) – Pivot Faceting – Pseudo-join – Improved Spatial (more later) – Full support for Lucene Codecs, pluggable scoring • Indexing – New Update Processors, including scripting option – Near real time • Schema and Config APIs + Schemaless • Cursors (aka Deep Paging) • Admin UI
  • 15. Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: – "geo”:”43.17614,-90.57341” – “geo”:”Circle(4.56,1.23 d=0.0710)” – “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: – fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" – fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))”
  • 16. Scaling Solr • Distributed/sharded indexing & search – Auto distributes updates and queries to appropriate shards – Near Real Time (NRT) indexing capable – Document routing extensions • Dynamically scalable – New SolrCloud instances add indexing and query capacity – Supports re-balancing (shard-splitting) • Reliable – No single point of failure – Transactions logged – Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  • 17. Solr as NoSQL • Non-traditional data stores • Not designed for SQL type queries • Distributed fault tolerant architecture • Document oriented, data format agnostic (JSON, XML, CSV, binary)
  • 19. APIs • New APIs for Schema and Solr Config – XML becoming more of an implementation detail • Managed Schema mode • Data-driven schema (aka schemaless) • Synonyms, stopwords, request handlers
  • 20. Beyond Solr: LucidWorks Open Source • Effortless AWS deployment and monitoring: http://www.github.com/lucidworks/solr-scale-tk • Logstash for Solr: https://github.com/LucidWorks/solrlogmanager • Banana (Kibana for Solr): https://github.com/LucidWorks/banana • Data Quality Toolkit: https://github.com/LucidWorks/data-quality • Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
  • 21. Summary • Lucene/Solr 4.x: – Faster – More Flexible – Easier than ever scaling – More reliable than ever • Go forth and rank!
  • 22. Resources • Me – [email protected] @gsingers on Twitter • LucidWorks – http://www.lucidworks.com – http://www.lucidworks.com/support-services/ask-the-experts/