Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

A Practical Data Science
Workbench:
spark-solr
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks

$ whoami
Now: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously:
• Allen Institute for AI: Semantic Search on academic research publications
• Twitter: account search, user interest modeling, content recommendations
• LinkedIn: profile search, generic entity-to-entity recommender systems
Prehistory:
• other software companies, algebraic topology, particle cosmology

Cold Start
Imagine you jumped into a new Data Lake…

• What is the “Minimum Viable Big Data Science Toolkit”?
• DB? Distributed FS? NoSQL store?
• ML libraries / frameworks (scripting? notebook? REPL?)
• text analysis or graph libraries?
• dataviz package?
• hosting layer (for models and/or POC apps)?
Cold Start

• Spark and Solr for Data Engineering
• Why Solr?
• Why Spark?
• Example rapid turnaround workflow: Searchhub
• data exploration
• clustering: unsupervised ML
• classification: supervised ML
• recommenders: collaborative filtering + content-based
+ “mixed-mode”
Overview

Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?

Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on
HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015
…
-rw-r--r-- 1 jake staff 63043884 Feb 4 18:22 part-00001.lzo
Now what? What’s in these files?

Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“s+“)
• relevancy / ranking
• realtime REST service layer / web console

• Apache Lucene
• Grouping and Joins
• Streaming parallel SQL
• Stats, expressions,
transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication

Why Spark for Solr?
• spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank,
popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)

Why do data engineering with Solr and Spark?
SolrSpark
• Data exploration and visualization
• Easy ingestion and feature
selection
• Powerful ranking features
• Quick and dirty classification and
clustering
• Simple operation and scaling
• Stats and math built in
• General purpose batch/streaming
compute engine
Whole collection analysis!
• Fast, large scale iterative
algorithms
• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big
data systems
and together: http://github.com/lucidworks/spark-solr

• Free Data ! ASF mailing-list archives + github + JIRA
• https://github.com/lucidworks/searchhub
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
• Build a recommender, mix & match with search
Example workflow: Searchhub
TM

• Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Searchhub: Initial Exploration

• try other text analyzers: (no more str.split(“w+”)! )
Smarter Text Analysis in Spark
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe

• Unsupervised machine learning with MLLib or Mahout:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
• Write the results back to solr:
Searchhub: Exploratory Data Science

• can also do something more like real Data Science:
Searchhub Classification: “Many Newsgroups”

Recommender Systems with Spark and Solr

• Recommender Systems
• content-based:
• mail-thread as “item”, head msgs grouped by replier
as “user” profile
• search query of users against items to recommend
• collaborative-filtering:
• users replying to a head msg “rate” them +-tively
• train a Spark-ML ALS RecSys model
• both can generate item-item similarity models
Spark+Solr RecSys

• With top-K closest items by both CF and Content:
• store them back into a Solr collection!
• fetch your (or generic user’s) recent items
• query them:
• “q=(cf:123^1.1 cf:39^2.3 cf:93^0.7)^alpha
(ct:912^2.9 ct:123^1.8 ct:99^2.2)^(1-alpha)”
Experimenting with mixed-mode Recommenders

Resources
• spark-solr: http://github.com/lucidworks/spark-solr
• searchhub: http://github.com/lucidworks/searchhub
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @pbrane

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16 (20)

More from MLconf (20)

Recently uploaded (20)

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16

Editor's Notes