SlideShare a Scribd company logo
Introduction To SolrCloud 
Varun Thacker
Apache Solr has a huge install base and tremendous momentum 
Solr is both established & growing 
250,000+ 
most widely used search 
solution on the planet. 8M+ total downloads 
monthly downloads 
You use Solr everyday. 
Solr has tens of thousands 
of applications in production. 
2500+ open Solr jobs. 
Activity Summary 
30 Day summary 
Aug 18 - Sep 17 2014 
• 128 Commits 
• 18 Contributors 
12 Month Summary 
Sep 17, 2013 - Sep 17, 2014 
• 1351 Commits 
• 29 Contributors 
via https://www.openhub.net/p/solr
Solr scalability is unmatched. 
• 10TB+ Index Size 
• 10 Billion+ Documents 
• 100 Million+ Daily Requests
Solr’s scalability is unmatched
What is Solr? 
• A system built to search text 
• A specialized type of database management 
system 
• A platform to build search applications on 
• Customizable, open source software
Where does Solr fit?
What is SolrCloud? 
Subset of optional features in Solr to enable and 
simplify horizontal scaling a search index using 
sharding and replication. 
Goals 
scalability, performance, high-availability, 
simplicity, and elasticity
Terminology 
• ZooKeeper: Distributed coordination service that provides centralised 
configuration, cluster state management, and leader election 
• Node: JVM process bound to a specific port on a machine 
• Collection: Search index distributed across multiple nodes with same 
configuration 
• Shard: Logical slice of a collection; each shard has a name, hash range, leader 
and replication factor. Documents are assigned to one and only one shard 
per collection using a hash-based document routing strategy 
• Replica: A copy of a shard in a collection 
• Overseer: A special node that executes cluster administration commands and 
writes updated state to ZooKeeper. Automatic failover and leader election.
Introduction to SolrCloud
Collection == Distributed Index 
• A collection is a distributed index defined by: 
• named configuration stored in ZooKeeper 
• number of shards: documents are distributed across N partitions of the index 
• document routing strategy: how documents get assigned to shards 
• replication factor: how many copies of each document in the collection 
• Collections API: 
• curl "http://localhost:8983/solr/admin/collections? 
action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll 
ection.configName=myonf
DEMO
Document Routing 
• Each shard covers a hash-range 
• Default: Hash ID into 32-bit integer, map to range 
• leads to balanced (roughly) shards 
• Custom-hashing 
• Tri-level: app!user!doc 
• Implicit: no hash-range set for shards
Replication 
• Why replicate? 
• High-availability 
• Load balancing 
• How does it work in SolrCloud? 
• Near-real-time, not master-slave 
• Leader forwards to replicas in parallel, waits for response 
• Error handling during indexing is tricky
Distributed Indexing 
• Get cluster state from ZK 
• Route document directly to leader (hash on doc ID) 
• Persist document on durable storage (tlog) 
• Forward to healthy replicas 
• Acknowledge write succeed to client
Shard Leader 
• Additional responsibilities during indexing only! 
Not a master node 
• Leader is a replica (handles queries) 
• Accepts update requests for the shard 
• Increments the _version_ on the new or updated 
doc 
• Sends updates (in parallel) to all replicas
Distributed Queries 
• Query client can be ZK aware or just query via a load 
balancer 
• Client can send query to any node in the cluster 
• Controller node distributes the query to a replica for 
each shard to identify documents matching query 
• Controller node sorts the results from step 3 and 
issues a second query for all fields for a page of 
results
Scalability / Stability Highlights 
• All nodes in cluster perform indexing and execute 
queries; no master node 
• Distributed indexing: No SPoF, high throughput via 
direct updates to leaders, automated failover to new 
leader 
• Distributed queries: Add replicas to scale-out qps; 
parallelize complex query computations; fault-tolerance 
• Indexing / queries continue so long as there is 1 healthy 
replica per shard
Zookeeper 
• Is a very good thing ... clusters are a zoo! 
• Centralized configuration management 
• Cluster state management 
• Leader election (shard leader and overseer) 
• Overseer distributed work queue 
• Live Nodes 
• Ephemeral znodes used to signal a server is gone 
• Needs 3 nodes for quorum in production
Zookeeper: State Management 
• Keep track of live nodes /live_nodes znode 
• ephemeral nodes 
• ZooKeeper client timeout 
• Collection metadata and replica state in /clusterstate.json 
• Every core has watchers for /live_nodes and / 
clusterstate.json 
• Leader election 
• ZooKeeper sequence number on ephemeral znodes
Other Features/Highlights 
• Near-Real-Time Search: Documents are visible within a second or so after 
being indexed 
• Partial Document Update: Just update the fields you need to change on 
existing documents 
• Optimistic Locking: Ensure updates are applied to the correct version of a 
document 
• Transaction log: Better recoverability; peer-sync between nodes after 
hiccups 
• HTTPS 
• Use HDFS for storing indexes 
• Use MapReduce for building index (SOLR-1301)
Solr on YARN
Solr on YARN 
• Run the SolrClient application : 
• Allocate container to run SolrMaster 
• SolrMaster requests containers to run SolrCloud 
nodes 
• Solr containers allocated across cluster 
• SolrCloud node connects to ZooKeeper
More Information on Solr on YARN 
• https://lucidworks.com/blog/solr-yarn/ 
• https://github.com/LucidWorks/yarn-proto 
• https://issues.apache.org/jira/browse/SOLR-6743
Our users are also pushing the limits 
https://twitter.com/bretthoerner/status/476830302430437376
Up, up and away! 
https://twitter.com/bretthoerner/status/476838275106091008
Connect @ 
https://twitter.com/varunthacker 
http://in.linkedin.com/in/varunthacker 
varun.thacker@lucidworks.com
Ad

More Related Content

What's hot (20)

Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
lucenerevolution
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
Anshum Gupta
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
Michał Warecki
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
Mark Miller
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
lucenerevolution
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
Anshum Gupta
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Lucidworks
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
Mark Miller
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 

Viewers also liked (12)

Scaling Solr with Solr Cloud
Scaling Solr with Solr CloudScaling Solr with Solr Cloud
Scaling Solr with Solr Cloud
Sematext Group, Inc.
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
Anshum Gupta
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
Shalin Shekhar Mangar
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
Lucidworks
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
Charlie Hull
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
Varun Thacker
 
Lucene 101
Lucene 101Lucene 101
Lucene 101
Varun Thacker
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
Omid Vahdaty
 
Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016
Viet-Dung TRINH
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
Anshum Gupta
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
Lucidworks
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Sematext Group, Inc.
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Solr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance studySolr and Elasticsearch, a performance study
Solr and Elasticsearch, a performance study
Charlie Hull
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
Varun Thacker
 
Introduction to apache zoo keeper
Introduction to apache zoo keeper Introduction to apache zoo keeper
Introduction to apache zoo keeper
Omid Vahdaty
 
Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016Apache zookeeper seminar_trinh_viet_dung_03_2016
Apache zookeeper seminar_trinh_viet_dung_03_2016
Viet-Dung TRINH
 
Ad

Similar to Introduction to SolrCloud (20)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
Anshul Patel
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
abenyeung1
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
Dharma Shukla
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Lucidworks
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
Lucidworks
 
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
InfluxData
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
DataWorks Summit
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
bloomreacheng
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
Nitin Sharma
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Geode introduction
Geode introductionGeode introduction
Geode introduction
Swapnil Bawaskar
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
Anshul Patel
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
abenyeung1
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019Cosmos DB at VLDB 2019
Cosmos DB at VLDB 2019
Dharma Shukla
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Lucidworks
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
Lucidworks
 
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
Paul Dix [InfluxData] | InfluxDays Opening Keynote | InfluxDays Virtual Exper...
InfluxData
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Hadoop-scale Search with Solr
Hadoop-scale Search with SolrHadoop-scale Search with Solr
Hadoop-scale Search with Solr
DataWorks Summit
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
bloomreacheng
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
Nitin Sharma
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
JSGB
 
Ad

Introduction to SolrCloud

  • 2. Apache Solr has a huge install base and tremendous momentum Solr is both established & growing 250,000+ most widely used search solution on the planet. 8M+ total downloads monthly downloads You use Solr everyday. Solr has tens of thousands of applications in production. 2500+ open Solr jobs. Activity Summary 30 Day summary Aug 18 - Sep 17 2014 • 128 Commits • 18 Contributors 12 Month Summary Sep 17, 2013 - Sep 17, 2014 • 1351 Commits • 29 Contributors via https://www.openhub.net/p/solr
  • 3. Solr scalability is unmatched. • 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests
  • 5. What is Solr? • A system built to search text • A specialized type of database management system • A platform to build search applications on • Customizable, open source software
  • 7. What is SolrCloud? Subset of optional features in Solr to enable and simplify horizontal scaling a search index using sharding and replication. Goals scalability, performance, high-availability, simplicity, and elasticity
  • 8. Terminology • ZooKeeper: Distributed coordination service that provides centralised configuration, cluster state management, and leader election • Node: JVM process bound to a specific port on a machine • Collection: Search index distributed across multiple nodes with same configuration • Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy • Replica: A copy of a shard in a collection • Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.
  • 10. Collection == Distributed Index • A collection is a distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copies of each document in the collection • Collections API: • curl "http://localhost:8983/solr/admin/collections? action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&coll ection.configName=myonf
  • 11. DEMO
  • 12. Document Routing • Each shard covers a hash-range • Default: Hash ID into 32-bit integer, map to range • leads to balanced (roughly) shards • Custom-hashing • Tri-level: app!user!doc • Implicit: no hash-range set for shards
  • 13. Replication • Why replicate? • High-availability • Load balancing • How does it work in SolrCloud? • Near-real-time, not master-slave • Leader forwards to replicas in parallel, waits for response • Error handling during indexing is tricky
  • 14. Distributed Indexing • Get cluster state from ZK • Route document directly to leader (hash on doc ID) • Persist document on durable storage (tlog) • Forward to healthy replicas • Acknowledge write succeed to client
  • 15. Shard Leader • Additional responsibilities during indexing only! Not a master node • Leader is a replica (handles queries) • Accepts update requests for the shard • Increments the _version_ on the new or updated doc • Sends updates (in parallel) to all replicas
  • 16. Distributed Queries • Query client can be ZK aware or just query via a load balancer • Client can send query to any node in the cluster • Controller node distributes the query to a replica for each shard to identify documents matching query • Controller node sorts the results from step 3 and issues a second query for all fields for a page of results
  • 17. Scalability / Stability Highlights • All nodes in cluster perform indexing and execute queries; no master node • Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader • Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance • Indexing / queries continue so long as there is 1 healthy replica per shard
  • 18. Zookeeper • Is a very good thing ... clusters are a zoo! • Centralized configuration management • Cluster state management • Leader election (shard leader and overseer) • Overseer distributed work queue • Live Nodes • Ephemeral znodes used to signal a server is gone • Needs 3 nodes for quorum in production
  • 19. Zookeeper: State Management • Keep track of live nodes /live_nodes znode • ephemeral nodes • ZooKeeper client timeout • Collection metadata and replica state in /clusterstate.json • Every core has watchers for /live_nodes and / clusterstate.json • Leader election • ZooKeeper sequence number on ephemeral znodes
  • 20. Other Features/Highlights • Near-Real-Time Search: Documents are visible within a second or so after being indexed • Partial Document Update: Just update the fields you need to change on existing documents • Optimistic Locking: Ensure updates are applied to the correct version of a document • Transaction log: Better recoverability; peer-sync between nodes after hiccups • HTTPS • Use HDFS for storing indexes • Use MapReduce for building index (SOLR-1301)
  • 22. Solr on YARN • Run the SolrClient application : • Allocate container to run SolrMaster • SolrMaster requests containers to run SolrCloud nodes • Solr containers allocated across cluster • SolrCloud node connects to ZooKeeper
  • 23. More Information on Solr on YARN • https://lucidworks.com/blog/solr-yarn/ • https://github.com/LucidWorks/yarn-proto • https://issues.apache.org/jira/browse/SOLR-6743
  • 24. Our users are also pushing the limits https://twitter.com/bretthoerner/status/476830302430437376
  • 25. Up, up and away! https://twitter.com/bretthoerner/status/476838275106091008
  • 26. Connect @ https://twitter.com/varunthacker http://in.linkedin.com/in/varunthacker [email protected]