Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.