Solr for Data Science

Solr for Data Science
Scalable search and analytics in one
Grant Ingersoll, CTO: @gsingers

http://github.com/lucidworks/solr-for-datascience

Solr in a nutshell
8M+ total
downloads
Solr is both established & growing
250,000+
monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search
solution on the planet.
Lucidworks
Unmatched Solr expertise.
1/3
of the active
committers
70%
of the open source
code is committed
Lucene/Solr Revolution
world’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands
of applications in production.
You use
Solr everyday.

Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance

It is increasingly important to know
what is important!
Corollary: The faster you know what is important, the better

• Solr - Logstash - Kibana
!
• http://lucidworks.com/
product/integrations/silk/
• Open source at:
• https://github.com/
LucidWorks/banana
• https://github.com/
LucidWorks/solrlogmanager
SiLK

• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Filters
• Analyzers
• Data quality tools
Feature Selection and Data Reduction

• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result
clustering
• Integration with Mahout
• Lucene provides Bayesian classifiers
built on index
• Easily build training and test sets via
filter queries
Classification and Clustering

• Built in expressions, stats, function
queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized
matrix
• More coming!
Math

Clicks, tweets, ratings, locations and much more can all
be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
!
Signals power relevance
Query Modification
Increase the findability of
documents and records with
automatic creation of tags, fields
and meta-data
Curate the user experience in
your application with artificial
result ranking, document
injections and obfuscation
Result ManipulationIndex Time Enrichment
Perform real time decision
making and routing in order to
map a users intention or
enterprise policy

• http://www.lucidworks.com/products/fusion
• Ships w/ built-in Solr-based Recommender OOTB,
but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
Lucidworks Fusion

• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst
others
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools

• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and
more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random,
more
• Easy to plug-in ranking
for Data Science

• More Facets/Stats
• Combine pivots, ranges and stats
• Percentiles via t-digest
• hyper-log-log
• Deeper Spark integration for Solr
• Custom distributed computation and aggregations/maths
• Advanced schema on read options
• Time series? Trends? Anomaly Detection?
• Learn to rank?
What’s coming?

Lucidworks Open Source
• Logstash for Solr:
• https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
• https://github.com/LucidWorks/banana
• Effortless AWS deployment and monitoring:
• http://www.github.com/lucidworks/solr-scale-tk
• Data Quality Toolkit:
• https://github.com/LucidWorks/data-quality
• Spark Integration
• https://github.com/LucidWorks/spark-solr

• This code: http://github.com/lucidworks/solr-for-
datascience
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Resources

Solr for Data Science

Recommended

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Solr for Data Science (20)

More from Grant Ingersoll (11)

Recently uploaded (20)

Solr for Data Science