SlideShare a Scribd company logo
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Rebalance API for SolrCloud
Nitin Sharma - Senior Software Engineer, Netflix
Suruchi Shah - Software Engineer, Bloomreach
3
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
➢ Allocation Strategies
➢ Open Source
➢ Summary
4
Motivation
●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter
search architecture based on solrcloud
●  Scaling issues related to Query Performance:
■  Inability to auto scale Solr serving with increasing index sizes
■  Dynamic Shard Setup based on index size
●  SLA issues due to Availability:
■  Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS
issues)
■  Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability
●  Data Freshness (aka Indexing Performance) hiccups:
■  Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs
●  Generic framework for Solr SLA management that can be open-sourced
5
Rebalance API
●  Fine-grained SLA management in Solr
●  Smarter Index, Cluster & Data Management for SolrCloud
●  Forms the basis for Solr Auto Scale
●  Admin handler in Solr
●  2 levels of abstraction
●  Scaling Strategies
○  Aid with shard, replica and cluster manipulation
techniques to guarantee SLA
○  Zero Downtime operations
○  Tunable for Availability, Performance or Cost
●  Allocation Strategies
○  Decides Core Placement
○  Tunable for Availability, Performance or Cost
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware
6
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
➢ Allocation Strategies
➢ Open Source
➢ Summary
7
01
Query Performance Issues
●  Indexing doubles documents - Per
shard latency goes up
●  Tp 95 shoots up
●  No way to change shards
dynamically
●  Delete, Recreate, Re-index
●  Availability goes down
Node 1
Zk Ensemble
Node 2
Solr Collection A
Indexing 2x documents
Shard 1 Shard 2
Query Tp95
Shard 3 Shard 4
8
03
Rebalance Auto Shard
●  Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency)
●  Includes re-distributing the index and configs consistently.
●  Zero downtime - No query failures
●  Avoiding any heavy re-indexing processes.
●  Can be size based as well
●  Sample API Call:
/admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD
&collection=collection_name&num_shards=number of shards &size_cap=1G
&allocation_strategy=least_used
9
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Rebalance Auto Shard - Overview
Shard 3 Shard 4
10
01
Auto Shard (1) - Internals - Simple Strategy
●  Merge documents from all shards -
Lucene library based
●  Split the merged shard into desired
number
●  Auto Zk update for shard ranges
●  Heavyweight but works
●  Based on size, could take in 20-30 of
minutes to complete
Solr Collection A
Shard 1 Shard 2
Merged Shard
(Temp)
Shard 1 Shard 2 Shard 3 Shard 4
Solr Collection A’
merge
Even Split
11
01
Auto Shard (2) - Internals - Smart Split Strategy
●  Identify minimum number of splits
●  Split shards in parallel to required desired
setup
●  Relatively high performance
○  2 Tb index from 2 to 4 shards - 2.5 mins
●  Auto Zk Update
Solr Collection A
Shard 1 Shard 2
Shard
1_1
Shard
1_2
Shard
2_1
Shard
2_2
Solr Collection A
Shard 1 Shard 2 Shard 3 Shard 4
Solr Collection A
Smart Split Renamed Shards
12
01
Solution : Auto Sharding
●  Dynamically Increase/Decrease shards
●  E.g. Increase shards from 2 to 4 to
reduce latency
●  E.g. Tp 95 reduced from > 1 sec to 250
ms.
Solr Collection A
Node 1
Zk Ensemble
Node 2
Shard 1 Shard 2
Shard 3 Shard 4
13
03
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
○  Data Consistency
➢ Allocation Strategies
➢ Open Source
➢ Summary
14
01
Data distribution issues
●  Adding a new node - Does nothing
●  No Redistribution of solr cores
●  Machines running out of disk space -
heavier collections need to be moved out
●  Problem amplifies at large scale - 100s of
nodes, 1000s of collections
●  Manual Management of core placement
becomes an issue
Default Solr Behavior
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4
15
01
Re-distribute Strategy - Internals
●  Internal topology construction from ZK
●  Desired Core placement computation - External
or Trigger based
●  Migration of cores within the cluster
●  Knobs to control min/max to reduce cluster
load
●  Zero downtime - No query failures and
resiliency to node failures
●  API Call: /admin/collections?
action=REBALANCE&scaling_strategy=REDIST
RIBUTE
Compute &
Redistribute
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
16
01
Solution: Auto Redistribution of Data
Redistribute
●  Adding new node -
triggers redistribution
●  Respects the core
placement allocation
strategy
●  Zero downtime
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4
Node
1
Zk Ensemble
Node
2
Node
3
Core
A2
Core
B1
Core
D2
Core
D1
Core
A1
Core
B2
Core
C1
Core
C2
Node
4
17
03
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
➢ Allocation Strategies
➢ Open Source
➢ Summary
18
01
Replace Solr Nodes
Default Solr Behavior
● A node might die, need to be replaced,
decommissioned
● Default behavior - Do nothing
● Can cause downtime - Heavy cores on
the nodes
● Problem exacerbated with 1000s of
nodes/collections
Node
1
ZkEnsemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
19
01
Replace Nodes with Rebalance
●  Read the Topology of cluster
●  Migrate replicas from node about to die to
new node
●  Zero downtime
●  API Call:
○  /admin/collections?
action=REBALANCE&scaling_strategy=REPLACE
&collection=collectionName&source_node=so
urce_host &dest_node=dest_host
Node
1
ZkEnsemble
Node
2
Node
3
Core
A2
Core
B1
Core
D1
Core
A1
Core
B2
Core
C1
Node
4
Core
B1
Core
A1
Replaced
Node
20
03
Agenda
➢ Motivation
➢ Introduction to Rebalance API
➢ Scaling Scenarios
○  Query Performance
○  Redistribution of Data
○  Removing/Replacing Nodes
○  Indexing Performance
➢ Allocation Strategies
➢ Open Source
➢ Summary
21
01
Indexing Performance
●  Higher the shards, faster the indexing
(parallelism)
●  Faster indexing - Data Freshness SLA
●  Solr - # shards is the same for indexing
vs serving
●  Shard setup - tweaked for serving
query performance
●  E.g.
○  Indexing 100M docs in 2 shards - 2
hours
○  Serving 100M docs in 2 shards - Tp
95 < 100 ms
Shard 1 Shard 2 Shard 3 Shard 4
Indexing
500M Documents
Performance Hit
Serving Queries
22
01
Indexing Performance - Smart Merge
●  Separate Indexing shard setup vs serving
●  More shards for indexing - Merged into lesser shards for serving
●  Post Indexing issue API to merge into serving collection
●  API Call:
○  /admin/collections?
action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collecti
onName&num_shards=numRequiredShards
●  Parallel Merge
●  Zero downtime
23
01
Indexing Performance - Smart Merge
● Index vs Serving has
different collections
● Indexed Collection
merged into Serving -
Using smart merge call
● Indexing can be tuned
independently for
performance
● Serving SLA unaffected
Shard 1 Shard 2 Shard 3 Shard 4
Indexing
500M Documents
Serving Queries
Shard
1
Shard
4
Shard
5
Shard
8
Shard
9
Shard
12
Shard
13
Shard
16… … … …
Collection A_Indexing
Collection A
Parallel merge Parallel merge Parallel merge Parallel merge
24
Rebalance API
●  Fine-grained SLA management in Solr
●  Smarter Index, Cluster & Data Management for SolrCloud
●  Forms the basis for Solr Auto Scale
●  Admin handler in Solr
●  2 levels of abstraction
●  Scaling Strategies
○  Aid with shard, replica and cluster manipulation
techniques to guarantee SLA
○  Zero Downtime operations
○  Tunable for Availability, Performance or Cost
●  Allocation Strategies
○  Decides Core Placement
○  Tunable for Availability, Performance or Cost
Rebalance API
Scaling Strategies Allocation Strategies
Auto Shard
Redistribute
Smart Merge
Replace
Scale Up
Scale Down
Least Used
Unused
AZ Aware
25
01
Allocation Strategies
●  Abstracts out the core placement methodology
●  Least Used Strategy - Pick the node that has the least amount of cores
●  AZ aware Strategy - Pick the node that is in a different availability zone than the other
cores for a given collection
●  Unused Strategy - Pick the node that does not have any cores for a given collection
●  All of them are compatible with all scaling strategies
26
01
Open Source
●  Fully open sourced - SOLR-9241
(4.6.1).
●  Contributed patch works on top of
4.6.1 and tested up to 4.10
●  SOLR-9241 (epic) - patches/features
on master. Has sub patches
○  SOLR-93{16-21}
○  SOLR-9407
●  Actively working with community to
get the rest of the API 6+ compatible.
27
01
Summary
●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search
architecture based on solrcloud
●  Rebalance API
○  Scaling Strategies - How to scale?
○  Allocation Strategies - Where to place cores?
●  Forms the basis for Solr Auto Scale
●  Zero Downtime operations for
○  Dynamically changing shard setup
○  Decoupling indexing SLA from Serving
○  Replacing Nodes
○  Auto -Redistributing data with cluster expansion
●  Open Source
28
01
Speakers
Nitin Sharma
https://www.linkedin.com/in/knitinsharma
nsarma1985@gmail.com
Suruchi Shah
https://www.linkedin.com/in/suruchishah
suruchi.shah13@gmail.com

More Related Content

PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
PDF
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PDF
High Performance Solr
PDF
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & Univers...
NYC Lucene/Solr Meetup: Spark / Solr
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
High Performance Solr
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...

What's hot (20)

PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
Introduction to SolrCloud
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PDF
Cross Datacenter Replication in Apache Solr 6
PPTX
Benchmarking Solr Performance at Scale
PPTX
Data analysis scala_spark
PDF
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
PDF
Time Series Processing with Apache Spark
PDF
Call me maybe: Jepsen and flaky networks
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
Scaling search with SolrCloud
PDF
Apache Sqoop: Unlocking Hadoop for Your Relational Database
ODP
Apache SolrCloud
PDF
Habits of Effective Sqoop Users
PDF
Solr4 nosql search_server_2013
PDF
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
How to make a simple cheap high availability self-healing solr cluster
Introduction to SolrCloud
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Cross Datacenter Replication in Apache Solr 6
Benchmarking Solr Performance at Scale
Data analysis scala_spark
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Scaling Through Partitioning and Shard Splitting in Solr 4
Time Series Processing with Apache Spark
Call me maybe: Jepsen and flaky networks
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Scaling search with SolrCloud
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache SolrCloud
Habits of Effective Sqoop Users
Solr4 nosql search_server_2013
Search-time Parallelism: Presented by Shikhar Bhushan, Etsy
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Ad

Viewers also liked (20)

PPTX
Scaling Solr with Solr Cloud
DOCX
Implementation of digital image watermarking techniques using dwt and dwt svd...
PDF
Understanding Cognitive Applications: A Framework - Sue Feldman
PPTX
The Evolution of Search and Big Data
PPTX
Introduction to enterprise search
PPTX
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
PDF
Plannning for the GSA Sunsetting feat. Coveo
PPTX
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
PPTX
Coveo Search - Product Overview
PDF
Coveo_Intelligent_Workplace_eBook
PDF
The Enterprise Search Market in a Nutshell
PPTX
Integrate ManifoldCF with Solr
PDF
Netflix Global Search - Lucene Revolution
PDF
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
PPTX
Solr Exchange: Introduction to SolrCloud
PDF
Webinar: Site Search in an Hour with Fusion
PDF
SolrCloud and Shard Splitting
PDF
Apache Solr crash course
PDF
Webinar: Building Conversational Search with Fusion
Scaling Solr with Solr Cloud
Implementation of digital image watermarking techniques using dwt and dwt svd...
Understanding Cognitive Applications: A Framework - Sue Feldman
The Evolution of Search and Big Data
Introduction to enterprise search
Sitecore Dev User Group Meetup in Milwaukee - Perficient - Rick Bauer
Plannning for the GSA Sunsetting feat. Coveo
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Coveo Search - Product Overview
Coveo_Intelligent_Workplace_eBook
The Enterprise Search Market in a Nutshell
Integrate ManifoldCF with Solr
Netflix Global Search - Lucene Revolution
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Solr Exchange: Introduction to SolrCloud
Webinar: Site Search in an Hour with Fusion
SolrCloud and Shard Splitting
Apache Solr crash course
Webinar: Building Conversational Search with Fusion
Ad

Similar to Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach (20)

PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
PPTX
Managing a SolrCloud cluster using APIs
PDF
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
PDF
Deploying and managing Solr at scale
PDF
How SolrCloud Changes the User Experience In a Sharded Environment
PDF
Seeley yonik solr performance key innovations
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
PPTX
Scaling SolrCloud to a large number of Collections
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PPTX
Running & Scaling Large Elasticsearch Clusters
ODP
GIDS2014: SolrCloud: Searching Big Data
PDF
Autoscaling Suggestions: Simplifying Operations - Varun Thacker, Lucidworks
PDF
Building a near real time search engine & analytics for logs using solr
PPTX
MyHeritage backend group - build to scale
PDF
Optimizing Elastic for Search at McQueen Solutions
PDF
Scaling search with Solr Cloud
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Managing a SolrCloud cluster using APIs
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Deploying and managing Solr at scale
How SolrCloud Changes the User Experience In a Sharded Environment
Seeley yonik solr performance key innovations
Solr Search Engine: Optimize Is (Not) Bad for You
Scaling SolrCloud to a large number of Collections
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Running & Scaling Large Elasticsearch Clusters
GIDS2014: SolrCloud: Searching Big Data
Autoscaling Suggestions: Simplifying Operations - Varun Thacker, Lucidworks
Building a near real time search engine & analytics for logs using solr
MyHeritage backend group - build to scale
Optimizing Elastic for Search at McQueen Solutions
Scaling search with Solr Cloud

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PDF
Omni-Path Integration Expertise Offered by Nor-Tech
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Modernizing your data center with Dell and AMD
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Newfamily of error-correcting codes based on genetic algorithms
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced IT Governance
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
Omni-Path Integration Expertise Offered by Nor-Tech
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Modernizing your data center with Dell and AMD
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
madgavkar20181017ppt McKinsey Presentation.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
Transforming Manufacturing operations through Intelligent Integrations
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Newfamily of error-correcting codes based on genetic algorithms
Big Data Technologies - Introduction.pptx
Advanced IT Governance

Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Shah, Bloomreach

  • 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  • 2. Rebalance API for SolrCloud Nitin Sharma - Senior Software Engineer, Netflix Suruchi Shah - Software Engineer, Bloomreach
  • 3. 3 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ➢ Allocation Strategies ➢ Open Source ➢ Summary
  • 4. 4 Motivation ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search architecture based on solrcloud ●  Scaling issues related to Query Performance: ■  Inability to auto scale Solr serving with increasing index sizes ■  Dynamic Shard Setup based on index size ●  SLA issues due to Availability: ■  Nightmare to manipulate cluster/collection setup with expanding clusters and frequent node replacements (AWS issues) ■  Flexible Replica Allocation based on custom strategies to guarantee 99.995% availability ●  Data Freshness (aka Indexing Performance) hiccups: ■  Break the tight coupling between Indexing and Serving latency (Tp 95) SLAs ●  Generic framework for Solr SLA management that can be open-sourced
  • 5. 5 Rebalance API ●  Fine-grained SLA management in Solr ●  Smarter Index, Cluster & Data Management for SolrCloud ●  Forms the basis for Solr Auto Scale ●  Admin handler in Solr ●  2 levels of abstraction ●  Scaling Strategies ○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA ○  Zero Downtime operations ○  Tunable for Availability, Performance or Cost ●  Allocation Strategies ○  Decides Core Placement ○  Tunable for Availability, Performance or Cost Rebalance API Scaling Strategies Allocation Strategies Auto Shard Redistribute Smart Merge Replace Scale Up Scale Down Least Used Unused AZ Aware
  • 6. 6 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ➢ Allocation Strategies ➢ Open Source ➢ Summary
  • 7. 7 01 Query Performance Issues ●  Indexing doubles documents - Per shard latency goes up ●  Tp 95 shoots up ●  No way to change shards dynamically ●  Delete, Recreate, Re-index ●  Availability goes down Node 1 Zk Ensemble Node 2 Solr Collection A Indexing 2x documents Shard 1 Shard 2 Query Tp95 Shard 3 Shard 4
  • 8. 8 03 Rebalance Auto Shard ●  Re-sharding existing collection to any number of destination shards. (e.g can help with reducing latency) ●  Includes re-distributing the index and configs consistently. ●  Zero downtime - No query failures ●  Avoiding any heavy re-indexing processes. ●  Can be size based as well ●  Sample API Call: /admin/collections?action=REBALANCE &scaling_strategy=AUTO_SHARD &collection=collection_name&num_shards=number of shards &size_cap=1G &allocation_strategy=least_used
  • 9. 9 Solr Collection A Node 1 Zk Ensemble Node 2 Shard 1 Shard 2 Rebalance Auto Shard - Overview Shard 3 Shard 4
  • 10. 10 01 Auto Shard (1) - Internals - Simple Strategy ●  Merge documents from all shards - Lucene library based ●  Split the merged shard into desired number ●  Auto Zk update for shard ranges ●  Heavyweight but works ●  Based on size, could take in 20-30 of minutes to complete Solr Collection A Shard 1 Shard 2 Merged Shard (Temp) Shard 1 Shard 2 Shard 3 Shard 4 Solr Collection A’ merge Even Split
  • 11. 11 01 Auto Shard (2) - Internals - Smart Split Strategy ●  Identify minimum number of splits ●  Split shards in parallel to required desired setup ●  Relatively high performance ○  2 Tb index from 2 to 4 shards - 2.5 mins ●  Auto Zk Update Solr Collection A Shard 1 Shard 2 Shard 1_1 Shard 1_2 Shard 2_1 Shard 2_2 Solr Collection A Shard 1 Shard 2 Shard 3 Shard 4 Solr Collection A Smart Split Renamed Shards
  • 12. 12 01 Solution : Auto Sharding ●  Dynamically Increase/Decrease shards ●  E.g. Increase shards from 2 to 4 to reduce latency ●  E.g. Tp 95 reduced from > 1 sec to 250 ms. Solr Collection A Node 1 Zk Ensemble Node 2 Shard 1 Shard 2 Shard 3 Shard 4
  • 13. 13 03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ○  Data Consistency ➢ Allocation Strategies ➢ Open Source ➢ Summary
  • 14. 14 01 Data distribution issues ●  Adding a new node - Does nothing ●  No Redistribution of solr cores ●  Machines running out of disk space - heavier collections need to be moved out ●  Problem amplifies at large scale - 100s of nodes, 1000s of collections ●  Manual Management of core placement becomes an issue Default Solr Behavior Node 1 Zk Ensemble Node 2 Node 3 Core A2 Core B1 Core D2 Core D1 Core A1 Core B2 Core C1 Core C2 Node 4
  • 15. 15 01 Re-distribute Strategy - Internals ●  Internal topology construction from ZK ●  Desired Core placement computation - External or Trigger based ●  Migration of cores within the cluster ●  Knobs to control min/max to reduce cluster load ●  Zero downtime - No query failures and resiliency to node failures ●  API Call: /admin/collections? action=REBALANCE&scaling_strategy=REDIST RIBUTE Compute & Redistribute Node 1 Zk Ensemble Node 2 Node 3 Core A2 Core B1 Core D1 Core A1 Core B2 Core C1 Node 1 Zk Ensemble Node 2 Node 3 Core A2 Core B1 Core D1 Core A1 Core B2 Core C1
  • 16. 16 01 Solution: Auto Redistribution of Data Redistribute ●  Adding new node - triggers redistribution ●  Respects the core placement allocation strategy ●  Zero downtime Node 1 Zk Ensemble Node 2 Node 3 Core A2 Core B1 Core D2 Core D1 Core A1 Core B2 Core C1 Core C2 Node 4 Node 1 Zk Ensemble Node 2 Node 3 Core A2 Core B1 Core D2 Core D1 Core A1 Core B2 Core C1 Core C2 Node 4
  • 17. 17 03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ➢ Allocation Strategies ➢ Open Source ➢ Summary
  • 18. 18 01 Replace Solr Nodes Default Solr Behavior ● A node might die, need to be replaced, decommissioned ● Default behavior - Do nothing ● Can cause downtime - Heavy cores on the nodes ● Problem exacerbated with 1000s of nodes/collections Node 1 ZkEnsemble Node 2 Node 3 Core A2 Core B1 Core D1 Core A1 Core B2 Core C1
  • 19. 19 01 Replace Nodes with Rebalance ●  Read the Topology of cluster ●  Migrate replicas from node about to die to new node ●  Zero downtime ●  API Call: ○  /admin/collections? action=REBALANCE&scaling_strategy=REPLACE &collection=collectionName&source_node=so urce_host &dest_node=dest_host Node 1 ZkEnsemble Node 2 Node 3 Core A2 Core B1 Core D1 Core A1 Core B2 Core C1 Node 4 Core B1 Core A1 Replaced Node
  • 20. 20 03 Agenda ➢ Motivation ➢ Introduction to Rebalance API ➢ Scaling Scenarios ○  Query Performance ○  Redistribution of Data ○  Removing/Replacing Nodes ○  Indexing Performance ➢ Allocation Strategies ➢ Open Source ➢ Summary
  • 21. 21 01 Indexing Performance ●  Higher the shards, faster the indexing (parallelism) ●  Faster indexing - Data Freshness SLA ●  Solr - # shards is the same for indexing vs serving ●  Shard setup - tweaked for serving query performance ●  E.g. ○  Indexing 100M docs in 2 shards - 2 hours ○  Serving 100M docs in 2 shards - Tp 95 < 100 ms Shard 1 Shard 2 Shard 3 Shard 4 Indexing 500M Documents Performance Hit Serving Queries
  • 22. 22 01 Indexing Performance - Smart Merge ●  Separate Indexing shard setup vs serving ●  More shards for indexing - Merged into lesser shards for serving ●  Post Indexing issue API to merge into serving collection ●  API Call: ○  /admin/collections? action=Rebalance&scaling_strategy=SMART_MERGE_DISTRIBUTED&collection=collecti onName&num_shards=numRequiredShards ●  Parallel Merge ●  Zero downtime
  • 23. 23 01 Indexing Performance - Smart Merge ● Index vs Serving has different collections ● Indexed Collection merged into Serving - Using smart merge call ● Indexing can be tuned independently for performance ● Serving SLA unaffected Shard 1 Shard 2 Shard 3 Shard 4 Indexing 500M Documents Serving Queries Shard 1 Shard 4 Shard 5 Shard 8 Shard 9 Shard 12 Shard 13 Shard 16… … … … Collection A_Indexing Collection A Parallel merge Parallel merge Parallel merge Parallel merge
  • 24. 24 Rebalance API ●  Fine-grained SLA management in Solr ●  Smarter Index, Cluster & Data Management for SolrCloud ●  Forms the basis for Solr Auto Scale ●  Admin handler in Solr ●  2 levels of abstraction ●  Scaling Strategies ○  Aid with shard, replica and cluster manipulation techniques to guarantee SLA ○  Zero Downtime operations ○  Tunable for Availability, Performance or Cost ●  Allocation Strategies ○  Decides Core Placement ○  Tunable for Availability, Performance or Cost Rebalance API Scaling Strategies Allocation Strategies Auto Shard Redistribute Smart Merge Replace Scale Up Scale Down Least Used Unused AZ Aware
  • 25. 25 01 Allocation Strategies ●  Abstracts out the core placement methodology ●  Least Used Strategy - Pick the node that has the least amount of cores ●  AZ aware Strategy - Pick the node that is in a different availability zone than the other cores for a given collection ●  Unused Strategy - Pick the node that does not have any cores for a given collection ●  All of them are compatible with all scaling strategies
  • 26. 26 01 Open Source ●  Fully open sourced - SOLR-9241 (4.6.1). ●  Contributed patch works on top of 4.6.1 and tested up to 4.10 ●  SOLR-9241 (epic) - patches/features on master. Has sub patches ○  SOLR-93{16-21} ○  SOLR-9407 ●  Actively working with community to get the rest of the API 6+ compatible.
  • 27. 27 01 Summary ●  Harder to scale & guarantee SLA (Availability, Query Performance, Data Freshness ) for a multi-tenant, cross datacenter search architecture based on solrcloud ●  Rebalance API ○  Scaling Strategies - How to scale? ○  Allocation Strategies - Where to place cores? ●  Forms the basis for Solr Auto Scale ●  Zero Downtime operations for ○  Dynamically changing shard setup ○  Decoupling indexing SLA from Serving ○  Replacing Nodes ○  Auto -Redistributing data with cluster expansion ●  Open Source