SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Lessons from Sharding Solr at Etsy
Gregg Donovan
@greggdonovan
Senior Software Engineer, etsy.com
• 5.5 Years Solr & Lucene at Etsy.com
• 3 Years Solr & Lucene at TheLadders.com
• Speaker at LuceneRevolution 2011 & 2013
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
Lessons from Sharding Solr
1.5Million Active Shops
32Million Items Listed
21.7Million Active Buyers
Agenda
• Sharding Solr at Etsy V0 — No sharding
• Sharding Solr at Etsy V1 — Local sharding
• Sharding Solr at Etsy V2 (*) — Distributed sharding
• Questions
* —What we’re about to launch.
Sharding V0 — Not Sharding
• Why do we shard?
• Data size grows beyond RAM on a single box
• Lucene can handle this, but there’s a performance cost
• Data size grows beyond local disk
• Latency requirements
• Not sharding allowed us to avoid many problems we’ll discuss later.
Sharding V0 — Not Sharding
• How to keep data size small enough for one host?
• Don’t store anything other than IDs
• fl=pk_id,fk_id,score
• Keep materialized objects in memcached
• Only index fields needed
• Prune index after experiments add fields
• Get more RAM
Lessons from Sharding Solr
Sharding V0 — Not Sharding
• How does it fail?
• GC
• Solution
• “Banner” protocol
• Client-side load balancer
• Client connects, waits for 4-bytes — OxCODEA5CF— from the server within 1-10ms before
sending query. Otherwise, try another server.
Sharding V1 — Local Sharding
• Motivations
• Better latency
• Smaller JVMs
• Tough to open a 31gb heap dump on your laptop
• Working set still fit in RAM on one box.
• What’s the simplest system we can built?
Sharding V1 — Local Sharding
• Lucene parallelism
• Shikhar Bhushan at Etsy experimented with segment level parallelism
• See Search-time Parallelism at Lucene Revolution 2014
• Made its way into LUCENE-6294 (Generalize how IndexSearcher parallelizes collection
execution). Committed in Lucene 5.1.
• Ended up with eight Solr shards per host, each in its own small JVM
• Moved query generation and re-ranking to separate process: the “mixer”
Sharding V1 — Local Sharding
• Based on Solr distributed search
• By default, Solr does two-pass distributed search
• First pass gets top IDs
• Second pass fetches stored fields for each top document
• Implemented distrib.singlePass mode (SOLR-5768)
• Does not make sense if individual documents are expensive to fetch
• Basic request tracing via HTTP headers (SOLR-5969)
Sharding V1 — Local Sharding
• Required us to fetch 1000+ results from each shard for reranking layer
• How to efficiently fetch 1000 documents per shard?
• Use Solr’s field syntax to fetch data from FieldCache
• e.g. fl=pk_id:field(pk_id),fk_id:field(fk_id),score
• When all fields are “pseudo” fields, no need to fetch stored fields per document.
Sharding V1 — Local Sharding
• Result
• Very large latency win
• Easy system to manage
• Well understood failure and recovery
• Avoided solving many distributed systems issues
Sharding V2 — Distributed Sharding
• Motivation
• Further latency improvements
• Prepare for data to exceed a single node’s capacity
• Significant latency improvements require finer sharding, more CPUs per request
• Requires a real distributed system and sophisticated RPC
• Before proceeding, stop what you’re doing and read everything by Google’s Jeff Dean and
Twitter’s Marius Eriksen
Sharding V2 — Distributed Sharding
• New problems
• Partial failures
• Lagging shards
• Synchronizing cluster state and configuration
• Network partitions
• Jespen
• Distributed IDF issues exacerbated
Solving Distributed IDF
• Inverse Document Frequency (IDF) now varies across shards, biasing ranking
• Calculate IDF offline in Hadoop
• IDFReplacedSimilarityFactory
• Offline data populates cache of Map<BytesRef,Float> (term —> score)
• Override SimilarityFactory#idfExplain
• Cache misses given rare document constant
• Can be extended to solve i18n IDF issues
Sharding V2 — Distributed Sharding
• ShardHandler
• Solr’s abstraction for fanning out queries to shards
• Ships with default implementation (HttpShardHandler) based on HTTP 1.1
• Does fanout (distrib=true) and processes requests coming from other Solr nodes
(distrib=false).
• Reads shards.rows and shards.start parameters
ShardHandler API
Solr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors
or takeCompletedOrError depending on partial results tolerance.
public abstract class ShardHandler {

public abstract void checkDistributed(ResponseBuilder rb);


public abstract void submit(ShardRequest sreq, String shard, ModifiableSolrParams params);

public abstract ShardResponse takeCompletedIncludingErrors();

public abstract ShardResponse takeCompletedOrError();

public abstract void cancelAll();

public abstract ShardHandlerFactory getShardHandlerFactory();

}
Sharding V2 — Distributed Sharding
Distributed query requirements
• Distributed tracing
• E.g.: Google’s Dapper, Twitter’s Zipkin, Etsy’s CrossStich
• Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
• Handle node failures, slowness
Lessons from Sharding Solr
Better Know Your Switches
Have a clear understanding of your networking requirements and whether your hardware meets
them.
• Prefer line-rate switches
• Prefer cut-through to store-and-forward
• No buffering, just read the IP packet header and move packet to the destination
• Track and graph switch statistics in the same dashboard you display your search latency stats
• errors, retransmits, etc.
Sharding V2 — Distributed Sharding
First experiment, Twitter’s Finagle
• Built on Netty
• Mux RPC multiplexing protocol
• SeeYour Server as a Function by Marius Eriksen
• Built-in support for Zipkin distributed tracing
• Served as inspiration for Facebook’s futures-based RPC Wangle
• Implemented a FinagleShardHandler
Sharding V2 — Distributed Sharding
Second experiment, custom Thrift-based protocol
• Blocking I/O easier to integrate with SolrJ API
• Able to integrate our own distributed tracing
• LZ4 compression via a custom Thrift TTransport
Sharding V2 — Distributed Sharding
Future experiment: HTTP/2
• One TCP connection for all requests between two servers
• Libraries
• Square’s OkHttp
• Google’s gRpc
• Jetty client in 9.3+ — appears to be Solr’s choice
Sharding V2 — Distributed Sharding
Implementation note
• Separated fanout from individual request processing
• SolrJ client via an EmbeddedSolrServer containing empty RAM directory.
• Saves a network hop
• Makes shards easier to profile, tune
• Can return result to SolrJ without sending merged results over the network
Sharding V2 — Distributed Sharding
• Good
• Individual shard times demonstrate very low average latency
• Bad
• Overall p95, p99 nowhere near averages
• Why? Lagging shards due to GC, filterCache misses, etc.
• More shards means more chances to hit outliers
Sharding V2 — Distributed Sharding
• Solutions
• See The Tail at Scale by Jeff Dean, CACM 2013.
• Eliminate all sources of inter-host variability
• No filter or other cache misses
• No GC
• Eliminate OS pauses, networking hiccups, deploys, restarts, etc.
• Not realistic
Sharding V2 — Distributed Sharding
• Backup Requests
• Methods
• Brute force — send two copies of every request to different hosts, take the fastest
response
• Less crude — wait X milliseconds for the first server to respond, then send a backup
request.
• Adaptive — choose X based on the first Y% of responses to return.
• Cancellation — Cancel the slow request to save CPU once you’re sure you don’t need it.
Sharding V2 — Distributed Sharding
• “Good enough”
• Return results to user after X% of results return if there are enough results. Don’t issue
backup requests, just cancel laggards.
• Only applicable in certain domains.
• Poses questions:
• Should you cache partial results?
• How is paging effected?
Resilience Testing
Now you own a distributed system. How do you know it works?
• “The Troublemaker”
• Inspired by Netflix’s Chaos Monkey
• Authored by Etsy’s Toria Gibbs
• Make sure humans can operate it
• Failure simulation — don’t wait until 3am
• Gameday exercises and Runbooks
Bonus material!
Better Know Your Kernel
A lesson not about sharding learned while sharding…
• Linux’s futex_wait() was broken in CentOS 6.6
• Backported patches needed from Linux 3.18
• Future direction: make kernel updates independent from distribution updates
• E.g. Plenty of good stuff (e.g. networking improvements, kernel introspection [see
@brendangregg]) between 3.10 and 4.2+, but it won’t come to CentOS for years
• Updating kernel alone easier to roll out
What else are we working on?
• Mesos for cluster orchestration
• GPUs for massive increases in per query computational capacity
Thanks for coming.
gregg@etsy.com
@greggdonovan
Questions?
@greggdonovan
gregg@etsy.com
Ad

More Related Content

What's hot (17)

Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Into The Box 2020 Keynote Day 1
Into The Box 2020 Keynote Day 1Into The Box 2020 Keynote Day 1
Into The Box 2020 Keynote Day 1
Ortus Solutions, Corp
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
nullhandle
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
Robert Virding
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
Jon Haddad
 
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVMState: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
Jonas Bonér
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
C4Media
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
Ricard Clau
 
Redis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHPRedis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHP
Ricard Clau
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
DataStax Academy
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 london
Paolo Negri
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
Martin Rehfeld
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
Paolo Negri
 
Migrating big data
Migrating big dataMigrating big data
Migrating big data
lauraxthomson
 
KubeCon 2019 Recap (Parts 1-3)
KubeCon 2019 Recap (Parts 1-3)KubeCon 2019 Recap (Parts 1-3)
KubeCon 2019 Recap (Parts 1-3)
Ford Prior
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
nullhandle
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
Robert Virding
 
Python & Cassandra - Best Friends
Python & Cassandra - Best FriendsPython & Cassandra - Best Friends
Python & Cassandra - Best Friends
Jon Haddad
 
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVMState: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
Jonas Bonér
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
C4Media
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
Ricard Clau
 
Redis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHPRedis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHP
Ricard Clau
 
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
Cassandra Day Denver 2014: Using Cassandra to Support Crisis Informatics Rese...
DataStax Academy
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 london
Paolo Negri
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
Martin Rehfeld
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
Paolo Negri
 
KubeCon 2019 Recap (Parts 1-3)
KubeCon 2019 Recap (Parts 1-3)KubeCon 2019 Recap (Parts 1-3)
KubeCon 2019 Recap (Parts 1-3)
Ford Prior
 

Similar to Lessons from Sharding Solr (20)

Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
LLC NewLink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
Anshum Gupta
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
Thijs Terlouw
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Lucidworks
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
LLC NewLink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640Xen and-the-art-of-rails-deployment2640
Xen and-the-art-of-rails-deployment2640
Newlink
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
Anshum Gupta
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
Eonblast
 
Spil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NLSpil Storage Platform (Erlang) @ EUG-NL
Spil Storage Platform (Erlang) @ EUG-NL
Thijs Terlouw
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucidworks
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 
Ad

Recently uploaded (20)

F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
F-Secure Freedome VPN 2025 Crack Plus Activation  New VersionF-Secure Freedome VPN 2025 Crack Plus Activation  New Version
F-Secure Freedome VPN 2025 Crack Plus Activation New Version
saimabibi60507
 
Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025Adobe Master Collection CC Crack Advance Version 2025
Adobe Master Collection CC Crack Advance Version 2025
kashifyounis067
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage DashboardsAdobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
Adobe Marketo Engage Champion Deep Dive - SFDC CRM Synch V2 & Usage Dashboards
BradBedford3
 
Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025Adobe Lightroom Classic Crack FREE Latest link 2025
Adobe Lightroom Classic Crack FREE Latest link 2025
kashifyounis067
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud PerformanceHow to Optimize Your AWS Environment for Improved Cloud Performance
How to Optimize Your AWS Environment for Improved Cloud Performance
ThousandEyes
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Ad

Lessons from Sharding Solr

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Lessons from Sharding Solr at Etsy Gregg Donovan @greggdonovan Senior Software Engineer, etsy.com
  • 3. • 5.5 Years Solr & Lucene at Etsy.com • 3 Years Solr & Lucene at TheLadders.com • Speaker at LuceneRevolution 2011 & 2013
  • 4. Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
  • 5. Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
  • 10. Agenda • Sharding Solr at Etsy V0 — No sharding • Sharding Solr at Etsy V1 — Local sharding • Sharding Solr at Etsy V2 (*) — Distributed sharding • Questions * —What we’re about to launch.
  • 11. Sharding V0 — Not Sharding • Why do we shard? • Data size grows beyond RAM on a single box • Lucene can handle this, but there’s a performance cost • Data size grows beyond local disk • Latency requirements • Not sharding allowed us to avoid many problems we’ll discuss later.
  • 12. Sharding V0 — Not Sharding • How to keep data size small enough for one host? • Don’t store anything other than IDs • fl=pk_id,fk_id,score • Keep materialized objects in memcached • Only index fields needed • Prune index after experiments add fields • Get more RAM
  • 14. Sharding V0 — Not Sharding • How does it fail? • GC • Solution • “Banner” protocol • Client-side load balancer • Client connects, waits for 4-bytes — OxCODEA5CF— from the server within 1-10ms before sending query. Otherwise, try another server.
  • 15. Sharding V1 — Local Sharding • Motivations • Better latency • Smaller JVMs • Tough to open a 31gb heap dump on your laptop • Working set still fit in RAM on one box. • What’s the simplest system we can built?
  • 16. Sharding V1 — Local Sharding • Lucene parallelism • Shikhar Bhushan at Etsy experimented with segment level parallelism • See Search-time Parallelism at Lucene Revolution 2014 • Made its way into LUCENE-6294 (Generalize how IndexSearcher parallelizes collection execution). Committed in Lucene 5.1. • Ended up with eight Solr shards per host, each in its own small JVM • Moved query generation and re-ranking to separate process: the “mixer”
  • 17. Sharding V1 — Local Sharding • Based on Solr distributed search • By default, Solr does two-pass distributed search • First pass gets top IDs • Second pass fetches stored fields for each top document • Implemented distrib.singlePass mode (SOLR-5768) • Does not make sense if individual documents are expensive to fetch • Basic request tracing via HTTP headers (SOLR-5969)
  • 18. Sharding V1 — Local Sharding • Required us to fetch 1000+ results from each shard for reranking layer • How to efficiently fetch 1000 documents per shard? • Use Solr’s field syntax to fetch data from FieldCache • e.g. fl=pk_id:field(pk_id),fk_id:field(fk_id),score • When all fields are “pseudo” fields, no need to fetch stored fields per document.
  • 19. Sharding V1 — Local Sharding • Result • Very large latency win • Easy system to manage • Well understood failure and recovery • Avoided solving many distributed systems issues
  • 20. Sharding V2 — Distributed Sharding • Motivation • Further latency improvements • Prepare for data to exceed a single node’s capacity • Significant latency improvements require finer sharding, more CPUs per request • Requires a real distributed system and sophisticated RPC • Before proceeding, stop what you’re doing and read everything by Google’s Jeff Dean and Twitter’s Marius Eriksen
  • 21. Sharding V2 — Distributed Sharding • New problems • Partial failures • Lagging shards • Synchronizing cluster state and configuration • Network partitions • Jespen • Distributed IDF issues exacerbated
  • 22. Solving Distributed IDF • Inverse Document Frequency (IDF) now varies across shards, biasing ranking • Calculate IDF offline in Hadoop • IDFReplacedSimilarityFactory • Offline data populates cache of Map<BytesRef,Float> (term —> score) • Override SimilarityFactory#idfExplain • Cache misses given rare document constant • Can be extended to solve i18n IDF issues
  • 23. Sharding V2 — Distributed Sharding • ShardHandler • Solr’s abstraction for fanning out queries to shards • Ships with default implementation (HttpShardHandler) based on HTTP 1.1 • Does fanout (distrib=true) and processes requests coming from other Solr nodes (distrib=false). • Reads shards.rows and shards.start parameters
  • 24. ShardHandler API Solr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors or takeCompletedOrError depending on partial results tolerance. public abstract class ShardHandler {
 public abstract void checkDistributed(ResponseBuilder rb); 
 public abstract void submit(ShardRequest sreq, String shard, ModifiableSolrParams params);
 public abstract ShardResponse takeCompletedIncludingErrors();
 public abstract ShardResponse takeCompletedOrError();
 public abstract void cancelAll();
 public abstract ShardHandlerFactory getShardHandlerFactory();
 }
  • 25. Sharding V2 — Distributed Sharding Distributed query requirements • Distributed tracing • E.g.: Google’s Dapper, Twitter’s Zipkin, Etsy’s CrossStich • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure • Handle node failures, slowness
  • 27. Better Know Your Switches Have a clear understanding of your networking requirements and whether your hardware meets them. • Prefer line-rate switches • Prefer cut-through to store-and-forward • No buffering, just read the IP packet header and move packet to the destination • Track and graph switch statistics in the same dashboard you display your search latency stats • errors, retransmits, etc.
  • 28. Sharding V2 — Distributed Sharding First experiment, Twitter’s Finagle • Built on Netty • Mux RPC multiplexing protocol • SeeYour Server as a Function by Marius Eriksen • Built-in support for Zipkin distributed tracing • Served as inspiration for Facebook’s futures-based RPC Wangle • Implemented a FinagleShardHandler
  • 29. Sharding V2 — Distributed Sharding Second experiment, custom Thrift-based protocol • Blocking I/O easier to integrate with SolrJ API • Able to integrate our own distributed tracing • LZ4 compression via a custom Thrift TTransport
  • 30. Sharding V2 — Distributed Sharding Future experiment: HTTP/2 • One TCP connection for all requests between two servers • Libraries • Square’s OkHttp • Google’s gRpc • Jetty client in 9.3+ — appears to be Solr’s choice
  • 31. Sharding V2 — Distributed Sharding Implementation note • Separated fanout from individual request processing • SolrJ client via an EmbeddedSolrServer containing empty RAM directory. • Saves a network hop • Makes shards easier to profile, tune • Can return result to SolrJ without sending merged results over the network
  • 32. Sharding V2 — Distributed Sharding • Good • Individual shard times demonstrate very low average latency • Bad • Overall p95, p99 nowhere near averages • Why? Lagging shards due to GC, filterCache misses, etc. • More shards means more chances to hit outliers
  • 33. Sharding V2 — Distributed Sharding • Solutions • See The Tail at Scale by Jeff Dean, CACM 2013. • Eliminate all sources of inter-host variability • No filter or other cache misses • No GC • Eliminate OS pauses, networking hiccups, deploys, restarts, etc. • Not realistic
  • 34. Sharding V2 — Distributed Sharding • Backup Requests • Methods • Brute force — send two copies of every request to different hosts, take the fastest response • Less crude — wait X milliseconds for the first server to respond, then send a backup request. • Adaptive — choose X based on the first Y% of responses to return. • Cancellation — Cancel the slow request to save CPU once you’re sure you don’t need it.
  • 35. Sharding V2 — Distributed Sharding • “Good enough” • Return results to user after X% of results return if there are enough results. Don’t issue backup requests, just cancel laggards. • Only applicable in certain domains. • Poses questions: • Should you cache partial results? • How is paging effected?
  • 36. Resilience Testing Now you own a distributed system. How do you know it works? • “The Troublemaker” • Inspired by Netflix’s Chaos Monkey • Authored by Etsy’s Toria Gibbs • Make sure humans can operate it • Failure simulation — don’t wait until 3am • Gameday exercises and Runbooks
  • 38. Better Know Your Kernel A lesson not about sharding learned while sharding… • Linux’s futex_wait() was broken in CentOS 6.6 • Backported patches needed from Linux 3.18 • Future direction: make kernel updates independent from distribution updates • E.g. Plenty of good stuff (e.g. networking improvements, kernel introspection [see @brendangregg]) between 3.10 and 4.2+, but it won’t come to CentOS for years • Updating kernel alone easier to roll out
  • 39. What else are we working on? • Mesos for cluster orchestration • GPUs for massive increases in per query computational capacity