0% found this document useful (0 votes)
479 views

Solr 3.1 and Beyond

The Solr/Lucene community is hard at work designing and developing a range of new features and fixes for Apache Solr, advancing the frontiers of search. Solr creator Yonik Seeley will provide a preview survey of these developments, and talk about how one can leverage new functionality. Topics will include new faceting functionality, new function queries, increased scalability, field collapsing, and spatial search. The talk will span features already included in trunk, features slated for the next release, as well as incomplete features under consideration for future releases.

Uploaded by

Ervin Miller
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
479 views

Solr 3.1 and Beyond

The Solr/Lucene community is hard at work designing and developing a range of new features and fixes for Apache Solr, advancing the frontiers of search. Solr creator Yonik Seeley will provide a preview survey of these developments, and talk about how one can leverage new functionality. Topics will include new faceting functionality, new function queries, increased scalability, field collapsing, and spatial search. The talk will span features already included in trunk, features slated for the next release, as well as incomplete features under consideration for future releases.

Uploaded by

Ervin Miller
Copyright
© Public Domain
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 33

Solr 3.

1 and Beyond
Yonik Seeley
Lucid Imagination
[email protected]
October 8, 2010

2
Agenda
Goal : Introduce new features you can try & use now in
Solr development versions 3.1 or 4.0

  Relevancy (Extended Dismax Parser)


  Spatial/Geo Search

  Search Result Grouping / Field Collapsing

  Faceting (Pivot, Range, Per-segment)

  Scalability (Solr Cloud)

  Odds & Ends

  Q&A

10/12/10 3
Solr 3.1? What happened to 1.5?

  Lucene/Solr merged (March 2010)


  Single set of committers
  Single dev mailing list ([email protected])
  Single shared subversion trunk
  Keep separate downloads, user mailing lists
  Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)
  Development
  trunk is now always next major release (currently 4.0)
  branch_3x will be base for all 3.x releases
  Branch together, Release together, Share version numbers
RELEVANCE
Extended Dismax Parser
  Superset of dismax
&defType=edismax&q=foo&qf=body  
  Fixes
edge cases where dismax could still throw
exceptions
OR      AND      NOT      -­‐      “  
  Full lucene syntax support
  Tries lucene syntax first
  Smart escaping is done if syntax errors
  Optionallysupports treating “and”/”or” as AND/OR in
lucene syntax
  Fielded queries (e.g. myfield:foo) even in degraded
mode
  uf parameter controls what field names may be directly specified in “q”
Extended Dismax Parser (continued)
  boost parameter for multiplicative boost-by-function
  Pure negative query clauses
Example: solr  OR  (-­‐solr)  
  Enhanced term proximity boosting
  pf2=myfield – results in term bigrams in sloppy phrase queries
 myfield:“aa  bb  cc”    -­‐>    myfield:“aa  bb”    myfield:“bb  cc”  
  Enhanced stopword handling
  stopwords omitted in main query, but added in optional proximity boosting part
Example: q=solr  is  awesome  &  qf=myfield  &  pf2=myfield      -­‐>        
 +myfield:(solr  awesome)    (myfield:”solr  is”  myfield:”is  
awesome”)  
  Currently controlled by the absence of StopWordFilter in index analyzer, and
presence in query analyzer
SPATIAL SEARCH

8
Spatial Search
Step1: Index some locations!
<field name=“name”>The Alpine Shop</field>
<field name=“store”>44.013617,-73.168264</field>

Step2: Decide where you are


&pt=44.0153371,-73.16734
&d=1
&sfield=store

Step3: Profit!

Spatial Filter: &fq={!geofilt}

Bounding Box: &fq={!bbox}

Distance Function: &sort=geodist() asc

10/12/10 9
RESULT GROUPING /
FIELD COLLAPSING
Field Collapsing Definition

  Field collapsing
  Limit the number of results per category
  “category” normally defined by unique values in a field

  Uses
  Web Search – collapse by web site
  Email threads – collapse by thread id

  Ecommerce/retail

  Show the top 5 items for each store category (music, movies,
etc)
Field Collapsing by Site
Result Grouping by Category
Field Collapse on Product Type
Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact
"grouped":{
"manu_exact":{
"matches":3,
"groups":[{
"groupValue":"Belkin",
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"}]
}},
{
"groupValue":"Apple Computer Inc.",
"doclist":{"numFound":1,"start":0,"docs":[
10/12/10 14

{
"id":"MA147LL/A",
Group by Query
http://...&group=true&group.query=price:[0 TO 99.99]
&group.query=price:[100 TO *]&group.limit=5
"grouped":{
"price:[0 TO 99.99]":{
"matches":3,
"doclist":{"numFound":2,"start":0,"docs":[
{
"id":"IW-02",
"name":"iPod & iPod Mini USB 2.0 Cable"},
{
"id":"F8V7067-APL-KIT",
"name":"Belkin Mobile Power Cord for iPod"}]
}},
"price:[100 TO *]":{
"matches":3,
10/12/10 15

"doclist":{"numFound":1,"start":0,"docs":[
{
Grouping Params
parameter meaning default

group.field=<field> Like facet.field – group by unique field


values
group.query=<query> Like facet.query – top docs that also
match
group.function=<function Group by unique values produced by
query> the function query
group.limit=<n> How many docs per group 1

group.sort=<sort spec> How to sort documents within a group Same as


“sort”
param
rows=<n> How many groups to return 10

sort=<sort spec> How to sort the groups relative to


each other (based on top doc)
10/12/10 16
FACETING
Pivot Faceting
  Other names that could have made sense:
  Grid Faceting, Cross-Product Faceting, Matrix Faceting
  Syntax: facet.pivot=field1,field2,field3,…
facet.pivot=cat,inStock

#docs #docs w/ #docs w/


inStock:true instock:false
cat:electronics 14 10 4
cat:memory 3 3 0
cat:connector 2 0 2
cat:graphics card 2 0 2
cat:hard drive 2 2 0

10/12/10 18
Pivot Faceting
http://...&facet=true&facet.pivot=cat,popularity
"facet_counts":{ (continued)
"facet_pivot":{
"cat,popularity":[{ {
"field":"cat", "field":"popularity",
14 docs w/ "value":"electronics", "value":"1",
cat==electronics "count":14, "count":2}]},
"pivot":[{ {
5 docs w/ "field":"popularity", "field":"cat",
cat==electronics "value":"6", "value":"memory",
&& popularity==6 "count":5}, "count":3,
{ "pivot":[]},
"field":"popularity",
10/12/10
"value":"7", […] 19

"count":4},
Range Faceting
"facet_counts":{
•  Like Date faceting, but "facet_ranges":{
more generic "price":{
"counts":{
"0.0":5,
http://...&facet=true "50.0":2,
&facet.range=price "100.0":0,
"150.0":2,
&facet.range.start=0 "200.0":0,
&facet.range.end=500 "250.0":1,
"300.0":2,
&facet.range.gap=50 "350.0":2,
"400.0":0,
"450.0":1},
"gap":50.0,
10/12/10
"start":0.0, 20
"end":500.0}}}}
Existing single-valued faceting
algorithm
Documents
matching the
base query Lucene FieldCache Entry
“Juggernaut” (StringIndex) for the “hero” field
q=Juggernaut 0 order: for each
&facet=true 2 lookup doc, an index into lookup: the
&facet.field=hero the lookup array string values
7
5 (null)
3 batman
accumulator
5 flash
0
1 spiderman
1
4 superman
Priority queue 0 increment
flash, 5
5 wolverine
0
Batman, 3 2
0
1
2
Per-segment single-valued
algorithm
Segment1 Segment2 Segment3 Segment4
FieldCache FieldCache FieldCache FieldCache
Entry Entry Entry Entry

accumulator1 accumulator2 accumulator3 accumulator4


inc
lookup 0 0 1 0
3 2 3 1
0
Base 5 1 0 0
DocSet 2
0 0 4
7 thread4
1 thread2 thread3
2
thread1 Priority queue
FieldCache +
flash, 5
accumulator Batman, 3
merger
(Priority queue)
Per-segment faceting
  Enable with facet.method=fcs
  Controllable multi-threading
facet.field={!threads=4}myfield  
  Disadvantages
  Larger memory use (FieldCaches + accumulators)
  Slower (extra FieldCache merge step needed)
  Advantages
  Rebuilds FieldCache entries only for new segments (NRT friendly)
  Multi-threaded
Per-segment faceting performance
comparison
Test index: 10M documents, 18 segments, single valued field

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms


A Time for request* facet.method=fc facet.method=fcs
static index 3 ms 244 ms
quickly changing index 1388 ms 267 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

B Time for request* facet.method=fc facet.method=fcs


static index 26 ms 34 ms
quickly changing index 741 ms 94 ms

*complete request time, measured externally


Faceting Performance Improvements

  For facet.method=enum, speed up initial


population of the filterCache (i.e. first time
facet): from 30% to 32x improvement
  Optimized facet.method=fc for multi-valued
fields and large facet.limit – up to 3x faster
  Optimized deep facet paging – up to 10x faster
with really large facet.offsets
  Less memory consumed by field cache entries

10/12/10 25
SCALABILITY
SolrCloud
  Firststeps toward simplifying cluster management
  Integrates Zookeeper
  Central configuration (schema.xml, solrconfig.xml, etc)
  Tracks live nodes + shards of collections
  Removes need for external load balancers
shards=localhost:8983/solr|localhost:8900/solr,  
             localhost:7574/solr|localhost:7500/solr  
  Can specify logical shard ids
shards=NY_shard,NJ_shard  
  Clients don’t need to know shards at all:
http://localhost:8983/solr/collection1/select?distrib=true  
SolrCloud : The Future

  Eliminate
all single points of failure
  Remove Master/Searcher distinction
  Enables near real-time search in a highly scalable environment
  High Availability for Writes
  Eventual consistency model (like Amazon Dynamo, Cassandra)
  Elastic
  Simply add/subtract servers, cluster will rebalance automatically
  By default, Solr will handle document partitioning
ODDS & ENDS
Auto-Suggest
  Many people currently use terms component
  Can be slow for a large corpus
  New auto-suggest builds off SpellCheck component
  Compact memory based trie for really fast completions
  Based on a field in the main index, or on a dictionary file

http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
"spellcheck":{
"suggestions":[
"ult",{
"numFound":1,
"startOffset":0,
"endOffset":3,
10/12/10
"suggestion":["ultrasharp"]}, 30
"collation","ultrasharp"]}}
Index with JSON
$  URL=http://localhost:8983/solr/update/json  
$  curl  $URL  -­‐H  'Content-­‐type:application/json'  -­‐d  '  
{  
"add":  {  
   "doc":  {  
       "id"  :  "978-­‐0641723445",  
       "cat"  :  ["book","hardcover"],  
       "title"  :  "The  Lightning  Thief",  
       "author"  :  "Rick  Riordan",  
       "series_t"  :  "Percy  Jackson  and  the  Olympians",  
       "sequence_i"  :  1,  
       "genre_s"  :  "fantasy",  
       "inStock"  :  true,  
       "price"  :  12.50,  
       "pages_i"  :  384  
   }  
}  
31
}'  
Query Results in CSV
http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv

name,price,cat,popularity
iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1
Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1
Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10

  Can handle multi-valued fields (see “cat” field in example)


  Completely compatible with the CSV update handler (can round-trip)

  Results are streamed – good for dumping entire parts of the index

10/12/10 32
http://localhost:8983/solr/browse

10/12/10 33
Q&A
For more information about Solr visit
www.lucidimagination.com

You might also like