Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Rebalance API for SolrCloud
Nitin Sharma - Senior Software Engineer, Netﬂix
Suruchi Shah - Software Engineer, Bloomreach

3
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
➢ Allocation Strategies
➢ Open Source
➢ Summary

4
Motivation
●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter
search architecture based on solrcloud
●  Scaling issues related to Query Performance:
■  Inability to auto scale Solr serving with increasing index sizes
■  Dynamic Shard Setup based on index size
●  SLA issues due to Availability:
■  Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS
issues)
■  Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability
●  Data Freshness (aka Indexing Performance) hiccups:
■  Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs
●  Generic framework for Solr SLA management that can be open-sourced

5
Rebalance API
●  Fine-grained SLA management in Solr
●  Smarter Index, Cluster & Data Management for SolrCloud
●  Forms the basis for Solr Auto Scale
●  Admin handler in Solr
●  2 levels of abstraction
●  Scaling Strategies
○  Aid with shard, replica and cluster manipulation
techniques to guarantee SLA
○  Zero Downtime operations
○  Tunable for Availability, Performance or Cost
●  Allocation Strategies
○  Decides Core Placement
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware

6
Agenda
➢ Motivation
➢ Open Source
➢ Summary

7
01
Query Performance Issues
●  Indexing doubles documents - Per
shard latency goes up
●  Tp 95 shoots up
●  No way to change shards
dynamically
●  Delete, Recreate, Re-index
●  Availability goes down
Node 1
Zk Ensemble
Node 2
Solr Collection A
Indexing 2x documents
Shard 1 Shard 2
Query Tp95
Shard 3 Shard 4

8
03
Rebalance Auto Shard
●  Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency)
●  Includes re-distributing the index and conﬁgs consistently.
●  Zero downtime - No query failures
●  Avoiding any heavy re-indexing processes.
●  Can be size based as well
●  Sample API Call:
/admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD
&collection=collection_name&num_shards=number of shards &size_cap=1G
&allocation_strategy=least_used

9
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Rebalance Auto Shard - Overview
Shard 3 Shard 4

10
01
Auto Shard (1) - Internals - Simple Strategy
●  Merge documents from all shards -
Lucene library based
●  Split the merged shard into desired
number
●  Auto Zk update for shard ranges
●  Heavyweight but works
●  Based on size, could take in 20-30 of
minutes to complete
Solr Collection A
Shard 1 Shard 2
Merged Shard
(Temp)
Shard 1 Shard 2 Shard 3 Shard 4
Solr Collection A’
merge
Even Split

11
01
Auto Shard (2) - Internals - Smart Split Strategy
●  Identify minimum number of splits
●  Split shards in parallel to required desired
setup
●  Relatively high performance
○  2 Tb index from 2 to 4 shards - 2.5 mins
●  Auto Zk Update
Solr Collection A
Shard 1 Shard 2
Shard
1_1
Shard
1_2
Shard
2_1
Shard
2_2
Solr Collection A
Solr Collection A
Smart Split Renamed Shards

12
01
Solution : Auto Sharding
●  Dynamically Increase/Decrease shards
●  E.g. Increase shards from 2 to 4 to
reduce latency
●  E.g. Tp 95 reduced from > 1 sec to 250
ms.
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Shard 3 Shard 4

13
03
Agenda
➢ Motivation
○  Data Consistency
➢ Open Source
➢ Summary

14
01
Data distribution issues
●  Adding a new node - Does nothing
●  No Redistribution of solr cores
●  Machines running out of disk space -
heavier collections need to be moved out
●  Problem ampliﬁes at large scale - 100s of
nodes, 1000s of collections
●  Manual Management of core placement
becomes an issue
Default Solr Behavior
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4

15
01
Re-distribute Strategy - Internals
●  Internal topology construction from ZK
●  Desired Core placement computation - External
or Trigger based
●  Migration of cores within the cluster
●  Knobs to control min/max to reduce cluster
load
●  Zero downtime - No query failures and
resiliency to node failures
●  API Call: /admin/collections?
action=REBALANCE&scaling_strategy=REDIST
RIBUTE
Compute &
Redistribute
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1

16
01
Solution: Auto Redistribution of Data
Redistribute
●  Adding new node -
triggers redistribution
●  Respects the core
placement allocation
strategy
●  Zero downtime
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4

17
03
Agenda
➢ Motivation
➢ Open Source
➢ Summary

18
01
Replace Solr Nodes
Default Solr Behavior
● A node might die, need to be replaced,
decommissioned
● Default behavior - Do nothing
● Can cause downtime - Heavy cores on
the nodes
● Problem exacerbated with 1000s of
nodes/collections
Node
1
ZkEnsemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1

19
01
Replace Nodes with Rebalance
●  Read the Topology of cluster
●  Migrate replicas from node about to die to
new node
●  API Call:
○  /admin/collections?
action=REBALANCE&scaling_strategy=REPLACE
&collection=collectionName&source_node=so
urce_host &dest_node=dest_host
Node
1
ZkEnsemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
Node
4
Core
B1
Core
A1
Replaced
Node

20
03
Agenda
➢ Motivation
➢ Open Source
➢ Summary

21
01
Indexing Performance
●  Higher the shards, faster the indexing
(parallelism)
●  Faster indexing - Data Freshness SLA
●  Solr - # shards is the same for indexing
vs serving
●  Shard setup - tweaked for serving
query performance
●  E.g.
○  Indexing 100M docs in 2 shards - 2
hours
○  Serving 100M docs in 2 shards - Tp
95 < 100 ms
Indexing
500M Documents
Performance Hit
Serving Queries

22
01
Indexing Performance - Smart Merge
●  Separate Indexing shard setup vs serving
●  More shards for indexing - Merged into lesser shards for serving
●  Post Indexing issue API to merge into serving collection
●  API Call:
○  /admin/collections?
action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collecti
onName&num_shards=numRequiredShards
●  Parallel Merge

23
01
Indexing Performance - Smart Merge
● Index vs Serving has
different collections
● Indexed Collection
merged into Serving -
Using smart merge call
● Indexing can be tuned
independently for
performance
● Serving SLA unaffected
Indexing
500M Documents
Serving Queries
Shard
1
Shard
4
Shard
5
Shard
8
Shard
9
Shard
12
Shard
13
Shard
16… … … …
Collection A_Indexing
Collection A
Parallel merge Parallel merge Parallel merge Parallel merge

24
Rebalance API
●  Fine-grained SLA management in Solr
●  Smarter Index, Cluster & Data Management for SolrCloud
●  Admin handler in Solr
●  2 levels of abstraction
●  Scaling Strategies
○  Aid with shard, replica and cluster manipulation
techniques to guarantee SLA
○  Zero Downtime operations
●  Allocation Strategies
○  Decides Core Placement
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware

25
01
Allocation Strategies
●  Abstracts out the core placement methodology
●  Least Used Strategy - Pick the node that has the least amount of cores
●  AZ aware Strategy - Pick the node that is in a different availability zone than the other
cores for a given collection
●  Unused Strategy - Pick the node that does not have any cores for a given collection
●  All of them are compatible with all scaling strategies

26
01
Open Source
●  Fully open sourced - SOLR-9241
(4.6.1).
●  Contributed patch works on top of
4.6.1 and tested up to 4.10
●  SOLR-9241 (epic) - patches/features
on master. Has sub patches
○  SOLR-93{16-21}
○  SOLR-9407
●  Actively working with community to
get the rest of the API 6+ compatible.

27
01
Summary
●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search
architecture based on solrcloud
●  Rebalance API
○  Scaling Strategies - How to scale?
○  Allocation Strategies - Where to place cores?
●  Zero Downtime operations for
○  Dynamically changing shard setup
○  Decoupling indexing SLA from Serving
○  Replacing Nodes
○  Auto -Redistributing data with cluster expansion
●  Open Source

28
01
Speakers
Nitin Sharma
https://www.linkedin.com/in/knitinsharma
nsarma1985@gmail.com
Suruchi Shah
https://www.linkedin.com/in/suruchishah
suruchi.shah13@gmail.com

Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach (20)

More from Lucidworks (20)

Recently uploaded (20)

Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach