SlideShare a Scribd company logo
Improving the Solr Update Chain
           Jan Høydahl
What will I cover?
Who is Jan Høydahl?
Intro to Solr’s (hidden) UpdateChain
How to write your own UpdateProcessors
Example: Web crawl @ Oslo University
A vision for future improvements
Conclusion




                      2
Improving the Solr Update Chain
Improving the Solr Update Chain
Jan Høydahl
          1995: Developer telecom
          1998: Java developer
          2000: Search - FAST
          2006: Lucene
          2007: Cominvent
          2011: Lucene committer


          > 100 projects



      5
Cominvent AS




Consulting & support       www.solrtraining.com
   Lucene/Solr
       FAST


                       6
Why document processing?
Analysis is Field oriented
Filters only see the “local” field




                      7
Why document processing?
But what if you want to:
  Add or remove fields?
  Make decisions based on other fields?
We need a way to modify the Document




                    8
Why document processing?

Doc1
name
postcode
                         programmer
cv_pdf_url
                        near Barcelona




               9
Why document processing?

Doc1
name
postcode
                          programmer
latlong
                         near Barcelona
cv_pdf_url
cv_text




                10
Why document processing?


   Client
    Doc1
     name
     postcode
     latlong
     cv_pdf_url
     cv_text




                  11
Why document processing?


Client
 Doc1
  name
  postcode     3rd party
  latlong      pipeline
  cv_pdf_url
  cv_text




                     12
Solr’s Update Chain




        13
The Update Chain




        14
The Update Chain




Doc
name
postcode
cv_pdf_url




                     15
The Update Chain
              Postcode
              ToLatLong
              Processor


Doc            Doc
name           name
postcode       postcode
cv_pdf_url     latlong
               cv_pdf_url




                            15
The Update Chain
              Postcode
                             UrlFetcher
              ToLatLong
                             Processor
              Processor


Doc            Doc           Doc
name           name              name
postcode       postcode          postcode
cv_pdf_url     latlong           latlong
               cv_pdf_url        cv_pdf_url
                                 cv_pdf_bin




                            15
The Update Chain
              Postcode                        Tika
                             UrlFetcher
              ToLatLong                       Extracting
                             Processor
              Processor                       Processor


Doc            Doc           Doc              Doc
name           name              name         name
postcode       postcode          postcode     postcode
cv_pdf_url     latlong           latlong      latlong
               cv_pdf_url        cv_pdf_url   cv_pdf_url
                                 cv_pdf_bin   cv_pdf_bin
                                              cv_text



                            15
Improving the Solr Update Chain
Improving the Solr Update Chain
How it’s wired
Chain definition in solrconfig.xml:




Choose chain in your update request:
.../solr/update/xml?..&update.chain=cv-chain



                       17
Other examples




Language Identification
          18
Other examples
Company

       The Apache Software Foundation
       (ASF) is a non-profit corporation to
       support Apache software projects.
       The ASF was formed from the
       Apache Group and incorporated in
       Delaware, U.S., in June 1999.

  Location                                    Date


                  Entity extraction
                          19
Improving the Solr Update Chain
Writing your own processor




            21
Writing your own processor




            21
Writing your own processor




             22
Writing your own processor




             23
Writing your own processor




•Make generic processors - parameterized
•Use SchemaAware, SolrCoreAware and
  ResourceLoaderAware interfaces
•Prefix param names to avoid name clash
•Testing and testable methods
•Donate back to Apache & document on Wiki
                      24
Web crawl with
Language Detection
 @ Oslo University

        25
Solr @ Oslo University




           26
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpd
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProces
                             27
Solr @ Oslo University
     <?xml version="1.0"?>
<updateRequestProcessorChain name="web">
  <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
      <str name="langid.fl">title,content</str>
      <str name="langid.idField">url</str>
      <str name="langid.fallbackFields"></str>
      <str name="langid.fallback">no</str>
      <str name="langid.whitelist">no,en</str>
      <str name="langid.langField">language</str>
      <str name="langid.langsField">languages</str>
      <bool name="langid.overwrite">true</bool>
      <bool name="langid.map">true</bool>
      <str name="langid.map.fl">title,content</str>
      <double name="langid.threshold">0.5</double>
    </lst>
  </processor>
  <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="fl">content_no content_en</str>
    <str name="pattern">[su00A0]+</str>
    <str name="replacement"> </str>
  </processor>
  <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
    <bool name="enabled">true</bool>
    <str name="title.fl">title_no,title_en</str>
                                           27
    <str name="content.fl">content_no,content_en</str>
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
</processor>
<processor class="no.uio.update.processor.RegexpReplaceProcessorFactory">
  <bool name="enabled">true</bool>

             Solr @ Oslo University
  <str name="fl">content_no content_en</str>
  <str name="pattern">[su00A0]+</str>
  <str name="replacement"> </str>
</processor>
<processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="title.fl">title_no,title_en</str>
  <str name="content.fl">content_no,content_en</str>
  <str name="contentType.field">tika_Content-Type</str>
  <str name="contentType.regexp">^(text/html|application/xhtml).*</str>
</processor>
<processor class="no.uio.update.processor.URLClassifyProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="domainOutputField">host</str>
  <str name="canonicalUrlOutputField">canonicalurl</str>
</processor>
<processor class="no.uio.update.processor.RegexpBoostProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">url</str>
  <str name="boostField">urlboost</str>
  <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str>
</processor>
<processor class="no.uio.update.processor.StaticRankProcessorFactory">
  <bool name="enabled">true</bool>
  <str name="inputField">canonicalurl</str>
  <str name="anchorField">anchortext</str>
  <str name="rankField">docrank</str>
  <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
                                         28
Donations back to Apache

SOLR-2599: FieldCopyProcessor
SOLR-2825: RegexReplaceProcessor
SOLR-2826: URLClassifyProcessor
SOLR-2827: RegexpBoostProcessor
SOLR-2828: StaticRankProcessor
Binary Document Dumper (?)

Many thanks for the donations!


                       29
Improving the Solr Update Chain
Improving the Solr Update Chain
Room for
improvement?



  32
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   34
Improvements
Pain:
  Potentially expensive initialization
  StaticRankProcessor: read&parse 50.000 lines

Proposed cure:
  Keep persistent state object in factory:
  private final Map<Object,Object> sharedObjCache
  new StaticRankProcessor(params, request,
  response, nextProcessor, sharedObjCache);
  Processor uses sharedObjCache for state



                       35
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                    36
Improvements
Pain:
  Multi chains often need identical Processors
  UiO’s two chains share 80% -> copy/paste

Proposed cure:
  Allow sharing of named instances
  Define:
  <processor name="langid" class="..">
  Refer:
  <processor ref="langid" />
  See SOLR-2823

                     37
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear or sub chains
Does not scale very well
Lack native scripting language support




                   38
Improvements
Pain:
  Chains are linear only
  Hard to do branching, sub chains, conditional...

Proposed cure (SOLR-2841):
  New scriptable Update Chain - alternative to XML
  Script chain logic in solr/conf/updateproc.groovy
  Full flexibility:
  chain myChain {
     if(doc.getFieldValue("type").equals("pdf"))
       process(tikaproc)
   }


                      39
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     40
Improvements
Pain:
  Single threaded
  Heavy processing not efficient

Proposed cure:
  Local: Use multi threaded update requests
  SolrCloud: Dedicated nodes, role=“processor” ?
  Wrap an external pipeline in UpdateProcessor
    Example: OpenPipelineUpdateProcessor ?




                        41
Improvements


Processors re-created for every request
Duplication of config between chains
No support for non-linear chains or sub chains
Does not scale very well
Lack native scripting language support




                     42
Improvements
Pain:
  Not really a “problem” :-)
  Nice to write processors in Python, Groovy, JS...

Proposed cure:
  Now: Finish SOLR-1725: Script based Processor
  Later: Make scripts first-class processors
    <processor script="myScript.py" />
    or
    <processor ref="myScript" />




                      43
One last thing...


       44
New standalone framework?
•The UpdateChain is Solr specific
•Interest for a pure pipeline framework
  •Search engine independent
  •Scalable
  •Rich pool of processors
  •Several existing candidates
•Some initial thoughts:
  http://wiki.apache.org/solr/DocumentProcessing




                          45
Summary


   46
Summary
•Document centric vs field centric processing
•UpdateChain is there - use it!
•Works well for most “light” cases
•Scaling issues, but caching config may help
•More processors welcome!




                       47
Questions?
Jan Høydahl, Cominvent AS
@cominvent
www.cominvent.com
Extra


 49
Alternative pipelines
   OpenPipeline (Dieselpoint)
•OpenPipe (T-Rank, now on GitHub)
•Pypes (ESR)
•UIMA (Apache)
•Eclipse SMILA
•Apache commons pipeline
•Piped (FoundIT, Norway)
•Behemoth (DigitaPebble)
•FindWise and TwigKit also has some technology



                      50
Calling out from UpdateChain
This is one way an
external pipeline
system can be
integrated with Solr.

The main benefit of
such a method is you
can continue to feed
content with SolrJ, DIH
or other Update
Request Handlers.


                          51
Scaling with external pipeline
Here is a more
advanced,
distributed
case, where a
Solr node is
dedicated for
processing, and
the entry point
Solr only
dispatches the
requests.


                  52
Ad

More Related Content

What's hot (20)

Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
Roy Zimmer
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHP
Zend by Rogue Wave Software
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
Sören Auer
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
ScyllaDB
 
Presto overview
Presto overviewPresto overview
Presto overview
Shixiong Zhu
 
Owl2 rl
Owl2 rlOwl2 rl
Owl2 rl
STIinnsbruck
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
System Programming and Administration
System Programming and AdministrationSystem Programming and Administration
System Programming and Administration
Krasimir Berov (Красимир Беров)
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
ScyllaDB
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
andyseaborne
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
Angel Borroy López
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
Craig Warman
 
Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.Solr 4: Run Solr in SolrCloud Mode on your local file system.
Solr 4: Run Solr in SolrCloud Mode on your local file system.
gutierrezga00
 
Automating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell ScriptingAutomating a Vendor File Load Process with Perl and Shell Scripting
Automating a Vendor File Load Process with Perl and Shell Scripting
Roy Zimmer
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
Introduction to column oriented databases in PHP
Introduction to column oriented databases in PHPIntroduction to column oriented databases in PHP
Introduction to column oriented databases in PHP
Zend by Rogue Wave Software
 
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational DatabasesWWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
WWW09 - Triplify Light-Weight Linked Data Publication from Relational Databases
Sören Auer
 
Scylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the WestScylla Summit 2017: SMF: The Fastest RPC in the West
Scylla Summit 2017: SMF: The Fastest RPC in the West
ScyllaDB
 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
ScyllaDB
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex Ruby on Rails (RoR) as a back-end processor for Apex
Ruby on Rails (RoR) as a back-end processor for Apex
Espen Brækken
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
andyseaborne
 
A Practical Introduction to Apache Solr
A Practical Introduction to Apache SolrA Practical Introduction to Apache Solr
A Practical Introduction to Apache Solr
Angel Borroy López
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 

Viewers also liked (20)

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David Smiley
Lucidworks
 
Intel
IntelIntel
Intel
GAUTAM KUMAR
 
Cpu spec
Cpu specCpu spec
Cpu spec
Sinu Jose
 
Introduction of cpu
Introduction of cpuIntroduction of cpu
Introduction of cpu
Tharindu Darshana
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003
Romain Rogister
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwal
Ujwal Limbu
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processor
vishalgohel12195
 
Arrandale presentation1
Arrandale presentation1Arrandale presentation1
Arrandale presentation1
Tata Consultancy Services
 
Intel i7
Intel i7Intel i7
Intel i7
Justin k.
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
ravikgiitk
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
brian d foy
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
Michał Warecki
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllers
Wendy Hemo
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
Nagarajan
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
Lucidworks
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
The Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David SmileyThe Latest in Spatial & Temporal Search: Presented by David Smiley
The Latest in Spatial & Temporal Search: Presented by David Smiley
Lucidworks
 
Romain Rogister DSP ppt V2003
Romain  Rogister  DSP  ppt V2003Romain  Rogister  DSP  ppt V2003
Romain Rogister DSP ppt V2003
Romain Rogister
 
Central Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwalCentral Processing Unit CUP by madridista ujjwal
Central Processing Unit CUP by madridista ujjwal
Ujwal Limbu
 
Data transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processorData transfer instruction set of 8085 micro processor
Data transfer instruction set of 8085 micro processor
vishalgohel12195
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Solrcloud Leader Election
Solrcloud Leader ElectionSolrcloud Leader Election
Solrcloud Leader Election
ravikgiitk
 
Perl Bag of Tricks - Baltimore Perl mongers
Perl Bag of Tricks  -  Baltimore Perl mongersPerl Bag of Tricks  -  Baltimore Perl mongers
Perl Bag of Tricks - Baltimore Perl mongers
brian d foy
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Microprocessors and controllers
Microprocessors and controllersMicroprocessors and controllers
Microprocessors and controllers
Wendy Hemo
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale ToolkitDeploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
thelabdude
 
Cpu scheduling(suresh)
Cpu scheduling(suresh)Cpu scheduling(suresh)
Cpu scheduling(suresh)
Nagarajan
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
Lucidworks
 
Ad

Similar to Improving the Solr Update Chain (20)

Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Crossref
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
Mykhailo Kolesnyk
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
Chef
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
ThirdWaveInsights
 
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Lviv Startup Club
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
Steve Speicher
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
Crossref
 
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Lviv Startup Club
 
Soap Toolkit Dcphp
Soap Toolkit DcphpSoap Toolkit Dcphp
Soap Toolkit Dcphp
Michael Tutty
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4
AtakanAral
 
Mufix Network Programming Lecture
Mufix Network Programming LectureMufix Network Programming Lecture
Mufix Network Programming Lecture
SiliconExpert Technologies
 
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptxHadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptx
ShashankSahu34
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenes
Enkitec
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
BG Java EE Course
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
Erik Hatcher
 
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...
Crossref
 
Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
Python tools for testing web services over HTTP
Python tools for testing web services over HTTPPython tools for testing web services over HTTP
Python tools for testing web services over HTTP
Mykhailo Kolesnyk
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
Sourcesense
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015Serve Meals, Not Ingredients - ChefConf 2015
Serve Meals, Not Ingredients - ChefConf 2015
Chef
 
Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)Serve Meals, Not Ingredients (ChefConf 2015)
Serve Meals, Not Ingredients (ChefConf 2015)
ThirdWaveInsights
 
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Lviv Startup Club
 
Innovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open InterfacesInnovate2014 Better Integrations Through Open Interfaces
Innovate2014 Better Integrations Through Open Interfaces
Steve Speicher
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
Crossref
 
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Oleksandr Shcherbak: DLT tool as ingest part of your ETL process (UA)
Lviv Startup Club
 
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays LIVE Helsinki - Implementing OpenAPI and GraphQL Services with gRPC b...
apidays
 
Software Engineering - RS4
Software Engineering - RS4Software Engineering - RS4
Software Engineering - RS4
AtakanAral
 
Hadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptxHadoop_File_Formats_and_Data_Ingestion.pptx
Hadoop_File_Formats_and_Data_Ingestion.pptx
ShashankSahu34
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Apex behind the scenes
Apex behind the scenesApex behind the scenes
Apex behind the scenes
Enkitec
 
Ad

More from Cominvent AS (9)

Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystem
Cominvent AS
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
Cominvent AS
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
Cominvent AS
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
Cominvent AS
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Cominvent AS
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwise
Cominvent AS
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent as
Cominvent AS
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
Cominvent AS
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company Presentation
Cominvent AS
 
Solr's missing plugin ecosystem
Solr's missing plugin ecosystemSolr's missing plugin ecosystem
Solr's missing plugin ecosystem
Cominvent AS
 
Scaling search with Solr Cloud
Scaling search with Solr CloudScaling search with Solr Cloud
Scaling search with Solr Cloud
Cominvent AS
 
First oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoyFirst oslo solr community meetup lightning talk janhoy
First oslo solr community meetup lightning talk janhoy
Cominvent AS
 
Dagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søkDagens Næringslivs overgang til Lucene/Solr søk
Dagens Næringslivs overgang til Lucene/Solr søk
Cominvent AS
 
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan HøydahlOslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl
Cominvent AS
 
Open source breakfast norge findwise
Open source breakfast norge findwiseOpen source breakfast norge findwise
Open source breakfast norge findwise
Cominvent AS
 
Frokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent asFrokostseminar mai 2010 solr open source cominvent as
Frokostseminar mai 2010 solr open source cominvent as
Cominvent AS
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
Cominvent AS
 
Cominvent AS company Presentation
Cominvent AS company PresentationCominvent AS company Presentation
Cominvent AS company Presentation
Cominvent AS
 

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 

Improving the Solr Update Chain

  • 1. Improving the Solr Update Chain Jan Høydahl
  • 2. What will I cover? Who is Jan Høydahl? Intro to Solr’s (hidden) UpdateChain How to write your own UpdateProcessors Example: Web crawl @ Oslo University A vision for future improvements Conclusion 2
  • 5. Jan Høydahl 1995: Developer telecom 1998: Java developer 2000: Search - FAST 2006: Lucene 2007: Cominvent 2011: Lucene committer > 100 projects 5
  • 6. Cominvent AS Consulting & support www.solrtraining.com Lucene/Solr FAST 6
  • 7. Why document processing? Analysis is Field oriented Filters only see the “local” field 7
  • 8. Why document processing? But what if you want to: Add or remove fields? Make decisions based on other fields? We need a way to modify the Document 8
  • 9. Why document processing? Doc1 name postcode programmer cv_pdf_url near Barcelona 9
  • 10. Why document processing? Doc1 name postcode programmer latlong near Barcelona cv_pdf_url cv_text 10
  • 11. Why document processing? Client Doc1 name postcode latlong cv_pdf_url cv_text 11
  • 12. Why document processing? Client Doc1 name postcode 3rd party latlong pipeline cv_pdf_url cv_text 12
  • 16. The Update Chain Postcode ToLatLong Processor Doc Doc name name postcode postcode cv_pdf_url latlong cv_pdf_url 15
  • 17. The Update Chain Postcode UrlFetcher ToLatLong Processor Processor Doc Doc Doc name name name postcode postcode postcode cv_pdf_url latlong latlong cv_pdf_url cv_pdf_url cv_pdf_bin 15
  • 18. The Update Chain Postcode Tika UrlFetcher ToLatLong Extracting Processor Processor Processor Doc Doc Doc Doc name name name name postcode postcode postcode postcode cv_pdf_url latlong latlong latlong cv_pdf_url cv_pdf_url cv_pdf_url cv_pdf_bin cv_pdf_bin cv_text 15
  • 21. How it’s wired Chain definition in solrconfig.xml: Choose chain in your update request: .../solr/update/xml?..&update.chain=cv-chain 17
  • 23. Other examples Company The Apache Software Foundation (ASF) is a non-profit corporation to support Apache software projects. The ASF was formed from the Apache Group and incorporated in Delaware, U.S., in June 1999. Location Date Entity extraction 19
  • 25. Writing your own processor 21
  • 26. Writing your own processor 21
  • 27. Writing your own processor 22
  • 28. Writing your own processor 23
  • 29. Writing your own processor •Make generic processors - parameterized •Use SchemaAware, SolrCoreAware and ResourceLoaderAware interfaces •Prefix param names to avoid name clash •Testing and testable methods •Donate back to Apache & document on Wiki 24
  • 30. Web crawl with Language Detection @ Oslo University 25
  • 31. Solr @ Oslo University 26
  • 32. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 33. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpd <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProces 27
  • 34. Solr @ Oslo University <?xml version="1.0"?> <updateRequestProcessorChain name="web"> <processor class="solr.update.processor.LanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">title,content</str> <str name="langid.idField">url</str> <str name="langid.fallbackFields"></str> <str name="langid.fallback">no</str> <str name="langid.whitelist">no,en</str> <str name="langid.langField">language</str> <str name="langid.langsField">languages</str> <bool name="langid.overwrite">true</bool> <bool name="langid.map">true</bool> <str name="langid.map.fl">title,content</str> <double name="langid.threshold">0.5</double> </lst> </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> 27 <str name="content.fl">content_no,content_en</str>
  • 35. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 36. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 37. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 38. </processor> <processor class="no.uio.update.processor.RegexpReplaceProcessorFactory"> <bool name="enabled">true</bool> Solr @ Oslo University <str name="fl">content_no content_en</str> <str name="pattern">[su00A0]+</str> <str name="replacement"> </str> </processor> <processor class="no.uio.update.processor.FixDuplicateContentTitleProcessorFactory"> <bool name="enabled">true</bool> <str name="title.fl">title_no,title_en</str> <str name="content.fl">content_no,content_en</str> <str name="contentType.field">tika_Content-Type</str> <str name="contentType.regexp">^(text/html|application/xhtml).*</str> </processor> <processor class="no.uio.update.processor.URLClassifyProcessorFactory"> <bool name="enabled">true</bool> <str name="domainOutputField">host</str> <str name="canonicalUrlOutputField">canonicalurl</str> </processor> <processor class="no.uio.update.processor.RegexpBoostProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">url</str> <str name="boostField">urlboost</str> <str name="boostFilename">${solr.solr.home}/conf/rank/urlboosts.txt</str> </processor> <processor class="no.uio.update.processor.StaticRankProcessorFactory"> <bool name="enabled">true</bool> <str name="inputField">canonicalurl</str> <str name="anchorField">anchortext</str> <str name="rankField">docrank</str> <str name="rankFilename">${solr.solr.home}/conf/rank/rankinfo.txt</str> </processor> <processor class="solr.LogUpdateProcessorFactory"/> 28
  • 39. Donations back to Apache SOLR-2599: FieldCopyProcessor SOLR-2825: RegexReplaceProcessor SOLR-2826: URLClassifyProcessor SOLR-2827: RegexpBoostProcessor SOLR-2828: StaticRankProcessor Binary Document Dumper (?) Many thanks for the donations! 29
  • 43. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 34
  • 44. Improvements Pain: Potentially expensive initialization StaticRankProcessor: read&parse 50.000 lines Proposed cure: Keep persistent state object in factory: private final Map<Object,Object> sharedObjCache new StaticRankProcessor(params, request, response, nextProcessor, sharedObjCache); Processor uses sharedObjCache for state 35
  • 45. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 36
  • 46. Improvements Pain: Multi chains often need identical Processors UiO’s two chains share 80% -> copy/paste Proposed cure: Allow sharing of named instances Define: <processor name="langid" class=".."> Refer: <processor ref="langid" /> See SOLR-2823 37
  • 47. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear or sub chains Does not scale very well Lack native scripting language support 38
  • 48. Improvements Pain: Chains are linear only Hard to do branching, sub chains, conditional... Proposed cure (SOLR-2841): New scriptable Update Chain - alternative to XML Script chain logic in solr/conf/updateproc.groovy Full flexibility: chain myChain { if(doc.getFieldValue("type").equals("pdf")) process(tikaproc) } 39
  • 49. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 40
  • 50. Improvements Pain: Single threaded Heavy processing not efficient Proposed cure: Local: Use multi threaded update requests SolrCloud: Dedicated nodes, role=“processor” ? Wrap an external pipeline in UpdateProcessor Example: OpenPipelineUpdateProcessor ? 41
  • 51. Improvements Processors re-created for every request Duplication of config between chains No support for non-linear chains or sub chains Does not scale very well Lack native scripting language support 42
  • 52. Improvements Pain: Not really a “problem” :-) Nice to write processors in Python, Groovy, JS... Proposed cure: Now: Finish SOLR-1725: Script based Processor Later: Make scripts first-class processors <processor script="myScript.py" /> or <processor ref="myScript" /> 43
  • 54. New standalone framework? •The UpdateChain is Solr specific •Interest for a pure pipeline framework •Search engine independent •Scalable •Rich pool of processors •Several existing candidates •Some initial thoughts: http://wiki.apache.org/solr/DocumentProcessing 45
  • 55. Summary 46
  • 56. Summary •Document centric vs field centric processing •UpdateChain is there - use it! •Works well for most “light” cases •Scaling issues, but caching config may help •More processors welcome! 47
  • 57. Questions? Jan Høydahl, Cominvent AS @cominvent www.cominvent.com
  • 59. Alternative pipelines OpenPipeline (Dieselpoint) •OpenPipe (T-Rank, now on GitHub) •Pypes (ESR) •UIMA (Apache) •Eclipse SMILA •Apache commons pipeline •Piped (FoundIT, Norway) •Behemoth (DigitaPebble) •FindWise and TwigKit also has some technology 50
  • 60. Calling out from UpdateChain This is one way an external pipeline system can be integrated with Solr. The main benefit of such a method is you can continue to feed content with SolrJ, DIH or other Update Request Handlers. 51
  • 61. Scaling with external pipeline Here is a more advanced, distributed case, where a Solr node is dedicated for processing, and the entry point Solr only dispatches the requests. 52

Editor's Notes