SlideShare a Scribd company logo
cominvent as
                         Enterprise Search Specialists




               Migrating FAST to Solr
                            By Jan Høydahl
               Oslo Enterprise Search MeetUp May 2010

cominvent as
Jan Høydahl

                  ●   IT architect - search,
                      telecom, mobile
                  ●   Helped build FAST's Global
                      Services as first engineer
                  ●   Founder of Cominvent AS
                  ●   Search consultant 10 years




cominvent as
cominvent as



cominvent as
Consulting

    – Cominvent delivers independent search consulting
    – Focus on Apache Lucene/Solr & Microsoft FAST ESP

               Idea –> architecture –> implementation




cominvent as
Commercial Support (Solr/Lucene)

    – When community & mailing list support is not enough..
    – Paid support agreement for Apache Solr/Lucene
    – In cooperation with Lucid Imagination

    – Read more: http://www.cominvent.com/support/




cominvent as
Training

    – Cominvent AS delivers training public and on-site
    – Certified Solr Training Partner for Lucid Imagination
    – Certified FAST ESP Training Partner

    – Read more: http://www.cominvent.com/training/




cominvent as
                                                       Photo: fluidpowerzone.com
Solr kurs




cominvent as
cominvent as
FAST & Solr are very similar...




cominvent as
Areas of usage




cominvent as
Common features




cominvent as
Common features




cominvent as
Introduction to...


                      ...for FAST people




cominvent as
Apache Solr - characteristics




                                    Search server


(Commercially friendly)




cominvent as
Apache Solr - characteristics




           Modular                  Community




  Contributions & patches
                                    Light weight
cominvent as
Solr-user community growth

                                                                             Solr-user growth
           1600




           1400




           1200




           1000
Messages




            800
                                                                                                                                                                Column B


            600




            400




            200




              0
                  2006 Mar   2006 Jul   2006 Nov    2007 Mar   2007 Jul   2007 Nov    2008 Mar   2008 Jul   2008 Nov    2009 Apr    2009 Aug   2009 Dec
            2006 Jan   2006 May   2006 Sep    2007 Jan   2007 May   2007 Sep    2008 Jan   2008 May   2008 Sep    2009 Feb    2009 Jun    2009 Oct   2010 Feb
     cominvent as                                                               Month
Lucene/Solr deployments




    – More: http://wiki.apache.org/solr/PublicServers
cominvent as
                                              Thanks to Lucid Imagination for logo collection
XML/HTTP




8
Solr Architecture




cominvent as
The Apache Software Foundation




cominvent as
Other ASF Lucene sub-projects

                        – Lucene Java library



                        – Rich document extraction


                        – Crawling web pages



                        – Machine learning
                           • Classification/clustering
                           • Collaborative filtering...
cominvent as
Introduction to...



                      ...for Solr people




cominvent as
FAST ESP – characteristics & key strengths




                                      Security




                       Connectors
cominvent as
FAST ESP – characteristics & key strengths




cominvent as
FAST ESP – characteristics & key strengths

    – Very strong document processing framework
               Format       Language     Linguistic
               Conversion   Detection    Normalization             Entities




                                            Custom
               Taxonomy      Sentiment                           Ontology
                                            Plug-in



                                            PARIS (Reuters) - Venus Williams raced into the second
                                               round of the $11.25 million French Open Monday,
                Search      Alert            brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.

                                            The Wimbledon and U.S. Open champion, seeded second,
                                                breezed past the German on a blustery center court to
                                                become the first seed to advance at Roland Garros.
                                                "I love being here, I love the French Open and more than
                                                anything I'd love to do well here," the American said.

                                            A first round loser last year, Williams is hoping to progress
cominvent as                                     beyond the quarter-finals for the first time in her career.
FAST ESP architecture




cominvent as
The migration...




cominvent as
Migration objectives

    – Possible objectives include:
        •   Lower maintenance cost
        •   Deeper in-house competency
        •   Less dependent on external consultants
        •   Ownership and visibility of source code
        •   Shorter time to market for new features
        •   Bugs fixed faster – or even fix ourselves
        •   Larger community, mailing lists that work!
        •   More choice in external consultants
        •   Contribute back to Open Source
        •   Lower HW footprint



cominvent as
Migration steps

    – Knowledge gathering & Training
    – Review current features & arch
        • Want to keep all features? Add new?
    – Migration areas:
        •   Index profile
        •   Content
        •   Feeding
        •   Document Processing
        •   Querying
        •   Search middleware?
        •   Admin & Operational
    – What to do in Application space vs Search space?

cominvent as
Feature comparison ESP – Solr (similarities)

               Feature                         ESP                  Solr
 Full-text, boolean, range search,       Yes                 Yes
 sorting, sub-second, facets, did-you-
 mean, synonyms, faceting
 Scaling for QPS                         Add rows            Add rows

 Scaling for document volume             Add columns         Add shards

 Synonyms                                Index/query side    Index/query side

 GEO search                              Yes                 Yes (1.5)

 Boolean query language                  Yes (FQL)           Yes (Lucene or
                                                             (e)DisMax)
 APIs                                    HTTP, Java, .NET,   HTTP, Java, .NET,
                                         C++, PHP            Ruby, Python, PHP,
                                                             Perl, JS

cominvent as
Feature comparison ESP – Solr (differences)

                Feature           ESP                Solr
 Admin server              Yes                No (coming 1.5)

 Processes                 Many (C++, Java,   One WAR in Java
                           Python)            app-server, 100%
                                              Java
 Navigators / Facets       Index-time         Query-time

 Did-you-mean              Dictionary based   Dictionary or
                                              index based
 Feeding                   API only           HTTP POST or API

 Document processing       Pipeline (py)      Simple pipeline
                                              (Java, JS, Groovy,
                                              Jython, JRuby..)
 Multi field querying      Composite fields   DisMax handler


cominvent as
Feature comparison ESP – Solr (differences)

                Feature                    ESP                  Solr
 Relevancy tuning                   Rank profiles, term Dynamic function
                                    boosting            queries and boost
                                                        functions
 XRANK                              XRANK operator       Function Queries

 Freshness boost                    Freshness in rank    Function Queries
                                    profile
 Boost GEO distance                 Rank profile and     Function Queries
                                    special
 Major schema or software updates   Cold update, use     Stage new content
                                    stage environment    into new Solr core
 Pluggability                       Docprocs, QT/RP      Everything :)
                                    (limited), clients   Request Handlers,
                                                         Query Parsers,
                                                         Docprocs, Rank,
                                                         Spell, tokenizer++
cominvent as
Feature comparison ESP – Solr (differences)

                Feature           ESP                  Solr
 Lemmatization             Can be licensed     Can be licensed
                           for many            from 3rd party
                           languages
 Query syntax              and(a:foo, b:bar)   a:foo OR b:bar
                           i:range(0, 100)     I:[0 TO 100]

                           d:range(2000-01-    d:[2000-01-
                           01T00:00:00,        01T00:00:00Z TO
                           2010-03-            NOW]
                           03T12:00:00)
 Query params              query=              q=
                           offset=             start=
                           hits=               rows=
                           spell=1             spellcheck=true
 What fields to return     view=viewname       fl=title,price,body...

cominvent as
Feature comparison ESP – Solr (differences)

               Feature           ESP                  Solr
 Search XML hierarchy     Yes, scope search    No

 Reports                  Built in analytics   Use 3rd party log
                                               analysis such as
                                               Splunk.com




cominvent as
Your existing FAST system - overview

                       Your web-app


                                      Search middleware?




cominvent as
                                              Graphics diagram: www.microsoft.com
Migrating index profile

    – ESP index profile -> Solr schema.xml
    – Setup field types, use defaults or create your own
    – Setup the static fields. ESP:



    – Solr equivalent:



    – No need for generic*, use dynamic fields:



cominvent as
Migrating index profile

    – Composite fields?
        • Solr can use <copyField> to copy multiple fields into
          one, e.g. as we did to map many attributes into one
          field
        • However, to achieve ranking with different boost of
          each field, Solr does not need composite field. Use
          DisMax query handler instead. Very powerful!
    – No need to edit schema to add new fields. Using
      dynamic fields, it is easy to e.g. Introduce a color facet
      for cars or a Mpixels facet for digital cameras




cominvent as
DisMax query example

    – This Solr query can replace use of composite-field
        • qt=dismax
        • q=oslo
        • qf=title^0.7 highpriorityfields^1.5
          mediumpriorityfields^0.6 lowpriorityfields^0.2
          recallfields^0.0 body^0.0
        • bf=recip(rord(creationDate),1,1000,1000)




cominvent as
Migrating content

    – If using FAST ContentAPI to push programatically
        • Use Solr's clients (Java, .NET, Ruby, Python, PHP...)
    – If feeding FastXML using FileTraverser
        • Feed as Solr XML using HTTP POST or a POST client




    – If you feed custom XML with XMLMapper
        • Have a look at DIH's import and mapping features


cominvent as
Push Feeding example

    – Feed XML using HTTP POST:
        • curl http://localhost:8080/solr/update?commit=true
          -H "Content-Type: text/xml"
          --data-binary @mydoc.xml
    – Ruby example:
        • >gem sources -a http://gemcutter.org
          >sudo gem install rsolr
          require 'rsolr'
          solr = RSolr.connect :url=>'http://localhost:8080'
          documents = [{:id=>1, :price=>1.00},
                    {:id=>2, :price=>10.50}]
          solr.add documents
          solr.commit


cominvent as
Pull: DataImportHandler (DIH)




cominvent as
Querying examples

    – http://localhost:8080/solr/select?q=car&fl=id,title




    – Ruby
        • res=solr.select :q=>'roses', :fq=>['red','white']
          res['response']['docs'].each do |doc|
            puts doc['title']
          end

cominvent as
Migrating document processing

    – Solr lacks a sophisticated pipeline with entity
      extraction etc. Alternatives:
        • Do extraction in Application space (Ruby)
        • Write own stage in Solr pipeline for simple cases
        • Integrate                 to do more advanced stuff
    – Matchers/extractors
        • LingPipe NamedEntityExtractor inside of OpenPipeline
    – Synonyms:
        • Use Solr's synonym handling index/query side
    – Custom stages:
        • Write a Solr UpdateProcessor (in Java, Jython etc)
    – Got a LOT of custom FAST docproc stages?
        • Have a look at SESAT's PY ProcServer for Solr (GPL)
cominvent as
Migrating linguistics (lemmatization)

    – Solr ships with Stemming instead of Lemmatization
    – Stemming has limitations
        • Biler, bilen, bilene -> bil
          BUT
        • Bøker, bøkene -> bøk; boka, bok -> bok
    – Kstem better. Free with LucidWorks for Solr
    – If you need singular/plural handling only
        • Free dictionaries? Check lucene-hunspell
    – Lemmatization can be licensed from 3rd party
      such as Basistech, who also has language
      identification & entity extraction
    – Language identification also from Sematext

cominvent as
Basistech Rosette for Lucene

    – High-end linguistics capabilities for
      19 languages
    – Language Identification
    – Segmentation and tokenization
    – Lemmatization
    – Noun decompounding
    – Part-of-speech tagging
    – Entity extraction

    – Easily integrated with Lucene/Solr

    – More: http://www.basistech.com/lucene/

cominvent as
Migrating search middleware

    – Using FAST Unity?
        • Consider migrating middleware logic such as external
          source querying and federation to SESAT (AGPL)
    – Using Comperio Front?
        • Ask Comperio for Solr engine support
        • Or migrate custom Q&R formats
    – Or is plain Solr enough?
        • Solr has built-in support for shards
        • A shard query will query multiple shards
          and merge the results into one
        • Add custom processing as Query
          Components in Solr
        • Check contrib & patches!

cominvent as
Migrating Front ends

    – Using a middleware with Solr support? Lucky you!
    – If not, consider introducing one now. Look at (Java):




    – If you decide to migrate from FAST Java/.NET APIs
        • Choose SolrJ or SolrNET
        • Query language differences. &fq= instead of filter()
        • Solr facets do not require sessions/state as FAST's
    – Migrate fast's «views» into named ReqHandler configs
    – Multi lingual: Need to handle title_no, title_en etc... :(

cominvent as
Migrating Web Crawler

    – Solr has no built-in web crawler
        • Instead you can choose from several integrations
    – The Apache Nutch crawler
        • Proven with hundreds of millions of pages
        • http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
    – Apache Droids
        • Still an incubator, but aims at becoming a full crawler
        • http://incubator.apache.org/droids/
    – Heritix + Solr (example in Solr1.4 book)
    – OpenPipeline has a (very) simple crawler
    – Lucene Connectors Framework
        • Preparing crawler support

cominvent as
Migrating Connectors

    – Solr handles these sources internally through DIH:
        • Database, RSS, Web-services, Local filesystem
    – Additionally throgh Lucene Connectors Framework:
        •

        • EMC Documentum, FileNet, JDBC, LiveLink, Patriarch
          (Memex), Meridio, SharePoint, RSS
        • New connectors should be written for LCF
    – Another option:
        •
        • Sharepoint, IMAP, Documentum, Vignette, Filesystem



cominvent as
Operations

    –   Solr has no admin-server (coming in 1.5)
    –   Possible to run multiple Tomcat on same server
    –   Multiple cores in same Tomcat – easier migration
    –   No built-in query reports, use 3rd party tools
    –   No built-in monitoring, have a look at



    – Log analysis? Check out




cominvent as
More info




cominvent as
Thank You


               www.cominvent.com



               jh@cominvent.com


               www.twitter.com/cominvent


               linkedin.com/in/janhoy

                  This presentation licensed under CC-by-sa license
cominvent as      You must attribute Cominvent with name and link

More Related Content

PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
PDF
Key topics when migrating from FAST to Solr, EuroCon 2010
ODP
Apache SolrCloud
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PDF
Building Lanyrd
PDF
Oracle LOB Internals and Performance Tuning
PDF
SQL Performance Improvements At a Glance in Apache Spark 3.0
PPT
Cassandra NoSQL
Oslo Solr MeetUp March 2012 - Solr4 alpha
Key topics when migrating from FAST to Solr, EuroCon 2010
Apache SolrCloud
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Building Lanyrd
Oracle LOB Internals and Performance Tuning
SQL Performance Improvements At a Glance in Apache Spark 3.0
Cassandra NoSQL

What's hot (13)

PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
PDF
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
PDF
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
PDF
Binary Obfuscation from the Top Down: Obfuscation Executables without Writing...
PDF
Installing & Configuring OpenLDAP (Hands On Lab)
PDF
OpenLDAP configuration brought to Apache Directory Studio
PPTX
An introduction to ROP
PPT
Scaling web applications with cassandra presentation
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PPTX
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
In-Memory Evolution in Apache Spark
PDF
Enabling Vectorized Engine in Apache Spark
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Scarab: SAT-based Constraint Programming System in Scala / Scala上で実現された制約プログラ...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Binary Obfuscation from the Top Down: Obfuscation Executables without Writing...
Installing & Configuring OpenLDAP (Hands On Lab)
OpenLDAP configuration brought to Apache Directory Studio
An introduction to ROP
Scaling web applications with cassandra presentation
NYC Lucene/Solr Meetup: Spark / Solr
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
How to make a simple cheap high availability self-healing solr cluster
In-Memory Evolution in Apache Spark
Enabling Vectorized Engine in Apache Spark
Ad

Viewers also liked (20)

PPT
Spirits Industry Tastings &amp; Special Events
PDF
Hcmf 2011 Rob Humphrey
PPSX
Is Your Income Protected?
PDF
Performance development Program - Inenrwealth Corporate Development
PPSX
English Home Work Oscar Tamara
PPT
Juarez Strategic Plan Association
PPT
Operation Al Fajr Iraq Nov 2004
PPTX
Angielskie metro
PPT
Cold Tundra Project Watts
PDF
Implementing ARIA for Real World Accessibility
PDF
Interim report Axfood Q3 2010
PPT
CRM AddOn Dial IT eCast
PDF
Using A Video Off The Internet
PDF
SocialMedia4SmallBiz_SpotOn
PPT
John Baird, General Manager, Freightwatch Mexico
PDF
Year End Report Axfood 2011
PPT
Il Modulo Nilde Utenti E L’Automazione Di Un Servizio
PDF
APSU Drupal Training Personal
PDF
Deloitte publicatie cloud diner
PPT
Group evaluation of Trapped
Spirits Industry Tastings &amp; Special Events
Hcmf 2011 Rob Humphrey
Is Your Income Protected?
Performance development Program - Inenrwealth Corporate Development
English Home Work Oscar Tamara
Juarez Strategic Plan Association
Operation Al Fajr Iraq Nov 2004
Angielskie metro
Cold Tundra Project Watts
Implementing ARIA for Real World Accessibility
Interim report Axfood Q3 2010
CRM AddOn Dial IT eCast
Using A Video Off The Internet
SocialMedia4SmallBiz_SpotOn
John Baird, General Manager, Freightwatch Mexico
Year End Report Axfood 2011
Il Modulo Nilde Utenti E L’Automazione Di Un Servizio
APSU Drupal Training Personal
Deloitte publicatie cloud diner
Group evaluation of Trapped
Ad

Similar to Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl (20)

PDF
Migrating Fast to Solr
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
NoSQL, Apache SOLR and Apache Hadoop
PDF
Find it, possibly also near you!
PDF
Introduction to Solr
PPTX
Building a real time, solr-powered recommendation engine
PDF
Rapid Prototyping with Solr
PDF
Introduction to Solr
PDF
Rapid Prototyping with Solr
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PPTX
Introduction to Apache Lucene/Solr
PPTX
PDF
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
PDF
Solr Powered Lucene
PDF
Solr Recipes
PDF
Get the most out of Solr search with PHP
PPTX
Solr site search makes shopping simple
PDF
Apache Solr crash course
PPTX
The Intent Algorithms of Search & Recommendation Engines
Migrating Fast to Solr
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
NoSQL, Apache SOLR and Apache Hadoop
Find it, possibly also near you!
Introduction to Solr
Building a real time, solr-powered recommendation engine
Rapid Prototyping with Solr
Introduction to Solr
Rapid Prototyping with Solr
ApacheCon Europe 2012 -Big Search 4 Big Data
Introduction to Apache Lucene/Solr
Krellenstein lucene revolution_2011_keynote_once_future_history_enterprise se...
Solr Powered Lucene
Solr Recipes
Get the most out of Solr search with PHP
Solr site search makes shopping simple
Apache Solr crash course
The Intent Algorithms of Search & Recommendation Engines

More from Cominvent AS (8)

PDF
Solr's missing plugin ecosystem
PDF
Scaling search with Solr Cloud
PDF
Improving the Solr Update Chain
PDF
First oslo solr community meetup lightning talk janhoy
PDF
Dagens Næringslivs overgang til Lucene/Solr søk
PDF
Open source breakfast norge findwise
PDF
Frokostseminar mai 2010 solr open source cominvent as
ODP
Cominvent AS company Presentation
Solr's missing plugin ecosystem
Scaling search with Solr Cloud
Improving the Solr Update Chain
First oslo solr community meetup lightning talk janhoy
Dagens Næringslivs overgang til Lucene/Solr søk
Open source breakfast norge findwise
Frokostseminar mai 2010 solr open source cominvent as
Cominvent AS company Presentation

Recently uploaded (20)

PDF
Why Endpoint Security Is Critical in a Remote Work Era?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
REPORT: Heating appliances market in Poland 2024
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
PDF
Google’s NotebookLM Unveils Video Overviews
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
SparkLabs Primer on Artificial Intelligence 2025
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Event Presentation Google Cloud Next Extended 2025
PPTX
CroxyProxy Instagram Access id login.pptx
Why Endpoint Security Is Critical in a Remote Work Era?
Understanding_Digital_Forensics_Presentation.pptx
NewMind AI Weekly Chronicles - July'25 - Week IV
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Revolutionize Operations with Intelligent IoT Monitoring and Control
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
agentic-ai-and-the-future-of-autonomous-systems.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
REPORT: Heating appliances market in Poland 2024
Transforming Manufacturing operations through Intelligent Integrations
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Google’s NotebookLM Unveils Video Overviews
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
SparkLabs Primer on Artificial Intelligence 2025
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Event Presentation Google Cloud Next Extended 2025
CroxyProxy Instagram Access id login.pptx

Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

  • 1. cominvent as Enterprise Search Specialists Migrating FAST to Solr By Jan Høydahl Oslo Enterprise Search MeetUp May 2010 cominvent as
  • 2. Jan Høydahl ● IT architect - search, telecom, mobile ● Helped build FAST's Global Services as first engineer ● Founder of Cominvent AS ● Search consultant 10 years cominvent as
  • 4. Consulting – Cominvent delivers independent search consulting – Focus on Apache Lucene/Solr & Microsoft FAST ESP Idea –> architecture –> implementation cominvent as
  • 5. Commercial Support (Solr/Lucene) – When community & mailing list support is not enough.. – Paid support agreement for Apache Solr/Lucene – In cooperation with Lucid Imagination – Read more: http://www.cominvent.com/support/ cominvent as
  • 6. Training – Cominvent AS delivers training public and on-site – Certified Solr Training Partner for Lucid Imagination – Certified FAST ESP Training Partner – Read more: http://www.cominvent.com/training/ cominvent as Photo: fluidpowerzone.com
  • 9. FAST & Solr are very similar... cominvent as
  • 13. Introduction to... ...for FAST people cominvent as
  • 14. Apache Solr - characteristics Search server (Commercially friendly) cominvent as
  • 15. Apache Solr - characteristics Modular Community Contributions & patches Light weight cominvent as
  • 16. Solr-user community growth Solr-user growth 1600 1400 1200 1000 Messages 800 Column B 600 400 200 0 2006 Mar 2006 Jul 2006 Nov 2007 Mar 2007 Jul 2007 Nov 2008 Mar 2008 Jul 2008 Nov 2009 Apr 2009 Aug 2009 Dec 2006 Jan 2006 May 2006 Sep 2007 Jan 2007 May 2007 Sep 2008 Jan 2008 May 2008 Sep 2009 Feb 2009 Jun 2009 Oct 2010 Feb cominvent as Month
  • 17. Lucene/Solr deployments – More: http://wiki.apache.org/solr/PublicServers cominvent as Thanks to Lucid Imagination for logo collection
  • 20. The Apache Software Foundation cominvent as
  • 21. Other ASF Lucene sub-projects – Lucene Java library – Rich document extraction – Crawling web pages – Machine learning • Classification/clustering • Collaborative filtering... cominvent as
  • 22. Introduction to... ...for Solr people cominvent as
  • 23. FAST ESP – characteristics & key strengths Security Connectors cominvent as
  • 24. FAST ESP – characteristics & key strengths cominvent as
  • 25. FAST ESP – characteristics & key strengths – Very strong document processing framework Format Language Linguistic Conversion Detection Normalization Entities Custom Taxonomy Sentiment Ontology Plug-in PARIS (Reuters) - Venus Williams raced into the second round of the $11.25 million French Open Monday, Search Alert brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes. The Wimbledon and U.S. Open champion, seeded second, breezed past the German on a blustery center court to become the first seed to advance at Roland Garros. "I love being here, I love the French Open and more than anything I'd love to do well here," the American said. A first round loser last year, Williams is hoping to progress cominvent as beyond the quarter-finals for the first time in her career.
  • 28. Migration objectives – Possible objectives include: • Lower maintenance cost • Deeper in-house competency • Less dependent on external consultants • Ownership and visibility of source code • Shorter time to market for new features • Bugs fixed faster – or even fix ourselves • Larger community, mailing lists that work! • More choice in external consultants • Contribute back to Open Source • Lower HW footprint cominvent as
  • 29. Migration steps – Knowledge gathering & Training – Review current features & arch • Want to keep all features? Add new? – Migration areas: • Index profile • Content • Feeding • Document Processing • Querying • Search middleware? • Admin & Operational – What to do in Application space vs Search space? cominvent as
  • 30. Feature comparison ESP – Solr (similarities) Feature ESP Solr Full-text, boolean, range search, Yes Yes sorting, sub-second, facets, did-you- mean, synonyms, faceting Scaling for QPS Add rows Add rows Scaling for document volume Add columns Add shards Synonyms Index/query side Index/query side GEO search Yes Yes (1.5) Boolean query language Yes (FQL) Yes (Lucene or (e)DisMax) APIs HTTP, Java, .NET, HTTP, Java, .NET, C++, PHP Ruby, Python, PHP, Perl, JS cominvent as
  • 31. Feature comparison ESP – Solr (differences) Feature ESP Solr Admin server Yes No (coming 1.5) Processes Many (C++, Java, One WAR in Java Python) app-server, 100% Java Navigators / Facets Index-time Query-time Did-you-mean Dictionary based Dictionary or index based Feeding API only HTTP POST or API Document processing Pipeline (py) Simple pipeline (Java, JS, Groovy, Jython, JRuby..) Multi field querying Composite fields DisMax handler cominvent as
  • 32. Feature comparison ESP – Solr (differences) Feature ESP Solr Relevancy tuning Rank profiles, term Dynamic function boosting queries and boost functions XRANK XRANK operator Function Queries Freshness boost Freshness in rank Function Queries profile Boost GEO distance Rank profile and Function Queries special Major schema or software updates Cold update, use Stage new content stage environment into new Solr core Pluggability Docprocs, QT/RP Everything :) (limited), clients Request Handlers, Query Parsers, Docprocs, Rank, Spell, tokenizer++ cominvent as
  • 33. Feature comparison ESP – Solr (differences) Feature ESP Solr Lemmatization Can be licensed Can be licensed for many from 3rd party languages Query syntax and(a:foo, b:bar) a:foo OR b:bar i:range(0, 100) I:[0 TO 100] d:range(2000-01- d:[2000-01- 01T00:00:00, 01T00:00:00Z TO 2010-03- NOW] 03T12:00:00) Query params query= q= offset= start= hits= rows= spell=1 spellcheck=true What fields to return view=viewname fl=title,price,body... cominvent as
  • 34. Feature comparison ESP – Solr (differences) Feature ESP Solr Search XML hierarchy Yes, scope search No Reports Built in analytics Use 3rd party log analysis such as Splunk.com cominvent as
  • 35. Your existing FAST system - overview Your web-app Search middleware? cominvent as Graphics diagram: www.microsoft.com
  • 36. Migrating index profile – ESP index profile -> Solr schema.xml – Setup field types, use defaults or create your own – Setup the static fields. ESP: – Solr equivalent: – No need for generic*, use dynamic fields: cominvent as
  • 37. Migrating index profile – Composite fields? • Solr can use <copyField> to copy multiple fields into one, e.g. as we did to map many attributes into one field • However, to achieve ranking with different boost of each field, Solr does not need composite field. Use DisMax query handler instead. Very powerful! – No need to edit schema to add new fields. Using dynamic fields, it is easy to e.g. Introduce a color facet for cars or a Mpixels facet for digital cameras cominvent as
  • 38. DisMax query example – This Solr query can replace use of composite-field • qt=dismax • q=oslo • qf=title^0.7 highpriorityfields^1.5 mediumpriorityfields^0.6 lowpriorityfields^0.2 recallfields^0.0 body^0.0 • bf=recip(rord(creationDate),1,1000,1000) cominvent as
  • 39. Migrating content – If using FAST ContentAPI to push programatically • Use Solr's clients (Java, .NET, Ruby, Python, PHP...) – If feeding FastXML using FileTraverser • Feed as Solr XML using HTTP POST or a POST client – If you feed custom XML with XMLMapper • Have a look at DIH's import and mapping features cominvent as
  • 40. Push Feeding example – Feed XML using HTTP POST: • curl http://localhost:8080/solr/update?commit=true -H "Content-Type: text/xml" --data-binary @mydoc.xml – Ruby example: • >gem sources -a http://gemcutter.org >sudo gem install rsolr require 'rsolr' solr = RSolr.connect :url=>'http://localhost:8080' documents = [{:id=>1, :price=>1.00}, {:id=>2, :price=>10.50}] solr.add documents solr.commit cominvent as
  • 42. Querying examples – http://localhost:8080/solr/select?q=car&fl=id,title – Ruby • res=solr.select :q=>'roses', :fq=>['red','white'] res['response']['docs'].each do |doc| puts doc['title'] end cominvent as
  • 43. Migrating document processing – Solr lacks a sophisticated pipeline with entity extraction etc. Alternatives: • Do extraction in Application space (Ruby) • Write own stage in Solr pipeline for simple cases • Integrate to do more advanced stuff – Matchers/extractors • LingPipe NamedEntityExtractor inside of OpenPipeline – Synonyms: • Use Solr's synonym handling index/query side – Custom stages: • Write a Solr UpdateProcessor (in Java, Jython etc) – Got a LOT of custom FAST docproc stages? • Have a look at SESAT's PY ProcServer for Solr (GPL) cominvent as
  • 44. Migrating linguistics (lemmatization) – Solr ships with Stemming instead of Lemmatization – Stemming has limitations • Biler, bilen, bilene -> bil BUT • Bøker, bøkene -> bøk; boka, bok -> bok – Kstem better. Free with LucidWorks for Solr – If you need singular/plural handling only • Free dictionaries? Check lucene-hunspell – Lemmatization can be licensed from 3rd party such as Basistech, who also has language identification & entity extraction – Language identification also from Sematext cominvent as
  • 45. Basistech Rosette for Lucene – High-end linguistics capabilities for 19 languages – Language Identification – Segmentation and tokenization – Lemmatization – Noun decompounding – Part-of-speech tagging – Entity extraction – Easily integrated with Lucene/Solr – More: http://www.basistech.com/lucene/ cominvent as
  • 46. Migrating search middleware – Using FAST Unity? • Consider migrating middleware logic such as external source querying and federation to SESAT (AGPL) – Using Comperio Front? • Ask Comperio for Solr engine support • Or migrate custom Q&R formats – Or is plain Solr enough? • Solr has built-in support for shards • A shard query will query multiple shards and merge the results into one • Add custom processing as Query Components in Solr • Check contrib & patches! cominvent as
  • 47. Migrating Front ends – Using a middleware with Solr support? Lucky you! – If not, consider introducing one now. Look at (Java): – If you decide to migrate from FAST Java/.NET APIs • Choose SolrJ or SolrNET • Query language differences. &fq= instead of filter() • Solr facets do not require sessions/state as FAST's – Migrate fast's «views» into named ReqHandler configs – Multi lingual: Need to handle title_no, title_en etc... :( cominvent as
  • 48. Migrating Web Crawler – Solr has no built-in web crawler • Instead you can choose from several integrations – The Apache Nutch crawler • Proven with hundreds of millions of pages • http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ – Apache Droids • Still an incubator, but aims at becoming a full crawler • http://incubator.apache.org/droids/ – Heritix + Solr (example in Solr1.4 book) – OpenPipeline has a (very) simple crawler – Lucene Connectors Framework • Preparing crawler support cominvent as
  • 49. Migrating Connectors – Solr handles these sources internally through DIH: • Database, RSS, Web-services, Local filesystem – Additionally throgh Lucene Connectors Framework: • • EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS • New connectors should be written for LCF – Another option: • • Sharepoint, IMAP, Documentum, Vignette, Filesystem cominvent as
  • 50. Operations – Solr has no admin-server (coming in 1.5) – Possible to run multiple Tomcat on same server – Multiple cores in same Tomcat – easier migration – No built-in query reports, use 3rd party tools – No built-in monitoring, have a look at – Log analysis? Check out cominvent as
  • 52. Thank You www.cominvent.com [email protected] www.twitter.com/cominvent linkedin.com/in/janhoy This presentation licensed under CC-by-sa license cominvent as You must attribute Cominvent with name and link