SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Rackspace Email’s solution for indexing 50k documents per second
George Bailey – Software Developer, Rackspace
Cameron Baker – Linux Engineer, Rackspace
george.bailey@rackspace.com
cameron.baker@rackspace.com
3
02Who we are…
•  “Rackers” dedicated to Fanatical Support!
•  Based out of San Antonio, TX
•  Part of Cloud Office
•  Email infrastructure development and engineering
4
022008: Original Problem
•  1 million mailboxes
•  Support needs to track message delivery
•  We need event aggregation + search
•  Needed to provide Fanatical Support!
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
5
02Original System Design
•  Scribed: log aggregation, deposit into HDFS
•  Hadoop 0.20: index via mapreduce
•  solr 1.4: search
•  Custom tools: index loader, mapreduce, scheduler
Scribed
6
02Original System Architecture
7
02Past performance
Step Time
Transport < 1 minute
Index Generation
(Mapreduce)
10 minutes
(cron)
Index Merge
10 minutes
(cron)
Searchable Events 20+ minutes
8
027 years later…
•  4+ million mailboxes
•  Still running solr 1.4, hadoop 0.20, scribed
•  Scaling, maintenance issues
•  Grew to 100+ physical servers, 15 VMs
•  Events need to be used in other contexts
•  20+ minute time-to-search no longer acceptable
9
02
Time to modernize!
10
02Goals
•  Improve customer experience – Fanatical Support!
•  Provide search results faster
•  Reduce technologies
•  Reduce the amount of custom code
•  Reduce the number of physical servers
11
02New System - Components
•  Apache Flume: aggregation + processing
•  Solr 1.4 to 4.x/5.x: NRT indexing, distributed search
•  SolrCloud allowed us to reduce custom code by 75%
12
02System architecture
13
02
Performance Tuning
14
02Flume: backpressure + hop availability
•  Sinks may be unreachable or slow
•  File Channel = durable buffering
•  capacity: disk / event size
•  transactionCapacity: match source / sink
•  minimumRequiredSpace
15
02Flume: batching and throughput
•  Batch size is important
•  File channels = slow
•  Memory channels = fast
•  “Loopback” flows
16
02Flume: controlling the flows
•  One event, multiple uses
•  Channel selectors
•  Optional channels
•  Interceptors
agent.sources.avroSource.selector.type = multiplexing
agent.sources.avroSource.selector.header = eventType
agent.sources.avroSource.selector.default = defaultChannel
agent.sources.avroSource.selector.authEvent = authEventChannel
agent.sources.avroSource.selector.mailEvent = mailEventChannel
agent.sources.avroSource.selector.optional.authEvent = optionalChannel
17
02Flume: Morphlines + Solr
•  Works with SolrCloud
•  Many helpful built-in commands
•  Scripting support for Java
•  Route to multiple collections
•  Validate, modify events in-flight
http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html
18
02Requirements for Solr
•  Near real time indexing of 30,000+ docs per sec
•  Few queries (< 10,000 per day)
•  Heavy distributed facet/group/sort queries
•  Support removing documents older than X days
•  Minimize JVM GC impact on indexing performance
19
02Basic Solr install
Server A
Solr
Replica
Server B
Solr
Replica
Server C
Solr
Replica
Server D
Solr
Replica
Collection
Shard 1 Shard 2
~2,500 docs per second
Goal 30,000 (30,000/2,500 = 12)
12 * # of Servers = 48 total servers
20
02Consult the experts…
•  Days of talking/100’s of emails with Rishi Easwaran
•  Recommendations from Shalin Mangar
•  solr-user@lucene.apache.org
Result:
•  Fewer physical servers
•  Faster indexing
21
02Collections – Optimized for additions/deletions
collection-2015-10-11
collection-2015-10-12
collection-2015-10-13
collection-2015-10-14
collection-2015-10-15collection-2015-10-16
•  Rolling collections by date
•  ~1 billion documents removed
•  Aliases for updates/queries
•  25 shards - 2 replicas per shard
22
02JVM – Lean and mean
•  4GB max/min JVM heap size
•  5 Solr JVM processes per server
•  Using Concurrent Mark Sweep GC
•  GC only on very heavy queries
•  GC < 10ms; occurs < 10 times a day
•  No impact on index performance
•  Reads 28 indexes; writes 2 indexes
Server A
Solr
Server A
Solr
Solr
Solr
Solr
Solr
23
02JVM Monitoring – before it’s too late
•  Proactive OOM monitoring
•  Memory not being released
•  Trapped in GC
•  Restart processes
•  Can impact entire cluster
24
02autoCommit for near real time indexing
Tested autoCommit and autoSoftCommit settings of:
•  autoCommit 5 seconds to 5 minutes
•  autoSoftCommit 1 second to 1 minute
Result:
•  autoSoftCommit of 5 seconds and autoCommit of 1
minute balanced out memory usage and disk IO
25
02DocValues – Reduced OOM Errors
•  Struggled with OOME under heavy load
•  Automated restart for nodes trapped in GC cycle
•  Distributed facet/group/sort queries
Solution:
•  docValues=“true” – for facet/group/sort fields
26
02Caching/Cache Warming – Measure and tune
•  filterCache/queryResultCache/documentCache/etc.
•  Very diverse queries (cache hits were too low)
•  Benefits for our workload did not justify the cost
•  Willing to accept slower queries
27
02Configs - Keep it simple
•  Example configs show off advanced features
•  If you are not using the feature, turn it off
•  Start with a trimmed down config
•  Only add features as needed
28
02
Performance Comparison
29
02Present performance
•  Sustained indexing of ~50,000 docs per sec
•  Each replica indexes ~1,000 docs per sec
•  New documents are searchable within 5 seconds
•  10,000 distributed facet/group/sort queries per day
•  1 billion new documents are indexed per day
•  13 billion documents are searchable
•  7TB of data across all indexes
30
02Performance Comparison
Step Performance (2008) Performance (2015)
Transport <1 minute <1 second (NRT)
Index
Generation
10 minutes <5 seconds
Index Merge 10 minutes N/A
Search 20+ minutes <5 seconds
•  Faster transport
•  No more batch processing
•  No external index generation
•  NRT indexing with SolrCloud
31
02Environment Comparison
Server Type Servers (2008) Servers (2015)
Transport
Physical: 4
Virtual: 15
Physical: 4
Virtual: 20
Storage /
processing
Physical: 100+
Virtual: 0
Physical: 0
Virtual: 0
Search
Physical: 12
Virtual: 0
Physical: 10
Virtual: 5
Total
Physical: 100+
Virtual: 15
Physical: 14
Virtual: 25
•  Flume / Solr handle event storage
and processing
•  No more Hadoop footprint
•  Over 80% reduction in servers
32
02Future…
•  Dedicated Solr nodes with SSDs for indexing
•  Shard query collections for improved performance
•  Larger JVM size for query nodes
•  Multiple Datacenter SolrCloud (replication/mirroring)
Rackspace Email’s solution for indexing 50k documents per second
George Bailey – Software Developer, Rackspace
Cameron Baker – Linux Engineer, Rackspace
george.bailey@rackspace.com
cameron.baker@rackspace.com
Thank you

More Related Content

PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
PDF
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
PDF
Solr security frameworks
PDF
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
PDF
A Practical Introduction to Apache Solr
PDF
Deploying and managing Solr at scale
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Inside Solr 5 - Bangalore Solr/Lucene Meetup
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
Solr security frameworks
Real Time Indexing and Search - Ashwani Kapoor & Girish Gudla, Trulia
A Practical Introduction to Apache Solr
Deploying and managing Solr at scale
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
How to deploy Apache Spark 
to Mesos/DCOS

What's hot (20)

PDF
Apache Zeppelin & Cluster
PPT
Solr vs ElasticSearch
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
PPTX
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
PPTX
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
PDF
NoSQL - No Security?
PDF
Apache Kafka® Security Overview
PDF
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Spark Streaming @ Scale (Clicktale)
PDF
Do's and don'ts when deploying akka in production
PPTX
Scale and Throughput @ Clicktale with Akka
PDF
How To Write Middleware In Ruby
PDF
A Novel methodology for handling Document Level Security in Search Based Appl...
PDF
NoSQL, no SQL injections?
PPTX
SplunkLive London 2014 Developer Presentation
PDF
aclpwn - Active Directory ACL exploitation with BloodHound
PPTX
Spring Integration Splunk
PPTX
Mutant Tests Too: The SQL
Apache Zeppelin & Cluster
Solr vs ElasticSearch
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
NoSQL - No Security?
Apache Kafka® Security Overview
Pakk Your Alpakka: Reactive Streams Integrations For AWS, Azure, & Google Cloud
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Spark Streaming @ Scale (Clicktale)
Do's and don'ts when deploying akka in production
Scale and Throughput @ Clicktale with Akka
How To Write Middleware In Ruby
A Novel methodology for handling Document Level Security in Search Based Appl...
NoSQL, no SQL injections?
SplunkLive London 2014 Developer Presentation
aclpwn - Active Directory ACL exploitation with BloodHound
Spring Integration Splunk
Mutant Tests Too: The SQL
Ad

Viewers also liked (10)

PDF
What's Your Money Persona?
PDF
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
PDF
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
PPTX
Flume and Hadoop performance insights
PPT
Bio solr building a better search for bioinformatics
PDF
SolrCloud on Hadoop
PDF
JSONSchema with golang
PPT
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
PDF
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
PPTX
Tutorial on developing a Solr search component plugin
What's Your Money Persona?
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
Flume and Hadoop performance insights
Bio solr building a better search for bioinformatics
SolrCloud on Hadoop
JSONSchema with golang
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Cross Data Center Replication for the Enterprise: Presented by Adam Williams,...
Tutorial on developing a Solr search component plugin
Ad

Similar to Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace (20)

PDF
Introduction to SolrCloud
PDF
Webinar: Faster Log Indexing with Fusion
PDF
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
PPTX
Benchmarking Solr Performance at Scale
PDF
Openstack meetup lyon_2017-09-28
PDF
Scalable and Reliable Logging at Pinterest
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
Using Riak for Events storage and analysis at Booking.com
PPTX
Scality S3 Server: Node js Meetup Presentation
PDF
John adams talk cloudy
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Monitoring MySQL at scale
PPTX
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
PDF
Kubernetes Walk Through from Technical View
PDF
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
PDF
The Data Mullet: From all SQL to No SQL back to Some SQL
PDF
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing twitter
Introduction to SolrCloud
Webinar: Faster Log Indexing with Fusion
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
Benchmarking Solr Performance at Scale
Openstack meetup lyon_2017-09-28
Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Using Riak for Events storage and analysis at Booking.com
Scality S3 Server: Node js Meetup Presentation
John adams talk cloudy
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Monitoring MySQL at scale
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Kubernetes Walk Through from Technical View
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
The Data Mullet: From all SQL to No SQL back to Some SQL
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing twitter

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
PDF
SparkLabs Primer on Artificial Intelligence 2025
PPTX
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
PPTX
How Much Does It Cost to Build a Train Ticket App like Trenitalia in Italy.pptx
PDF
Why Endpoint Security Is Critical in a Remote Work Era?
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
PDF
Software Development Methodologies in 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
DevOps & Developer Experience Summer BBQ
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - July'25 - Week IV
SparkLabs Primer on Artificial Intelligence 2025
ABU RAUP TUGAS TIK kelas 8 hjhgjhgg.pptx
How Much Does It Cost to Build a Train Ticket App like Trenitalia in Italy.pptx
Why Endpoint Security Is Critical in a Remote Work Era?
Top Generative AI Tools for Patent Drafting in 2025.pdf
agentic-ai-and-the-future-of-autonomous-systems.pdf
Software Development Methodologies in 2025
NewMind AI Weekly Chronicles - August'25 Week I
Event Presentation Google Cloud Next Extended 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Smarter Business Operations Powered by IoT Remote Monitoring
A Day in the Life of Location Data - Turning Where into How.pdf
DevOps & Developer Experience Summer BBQ
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
NewMind AI Monthly Chronicles - July 2025

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Rackspace Email’s solution for indexing 50k documents per second George Bailey – Software Developer, Rackspace Cameron Baker – Linux Engineer, Rackspace [email protected] [email protected]
  • 3. 3 02Who we are… •  “Rackers” dedicated to Fanatical Support! •  Based out of San Antonio, TX •  Part of Cloud Office •  Email infrastructure development and engineering
  • 4. 4 022008: Original Problem •  1 million mailboxes •  Support needs to track message delivery •  We need event aggregation + search •  Needed to provide Fanatical Support! http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
  • 5. 5 02Original System Design •  Scribed: log aggregation, deposit into HDFS •  Hadoop 0.20: index via mapreduce •  solr 1.4: search •  Custom tools: index loader, mapreduce, scheduler Scribed
  • 7. 7 02Past performance Step Time Transport < 1 minute Index Generation (Mapreduce) 10 minutes (cron) Index Merge 10 minutes (cron) Searchable Events 20+ minutes
  • 8. 8 027 years later… •  4+ million mailboxes •  Still running solr 1.4, hadoop 0.20, scribed •  Scaling, maintenance issues •  Grew to 100+ physical servers, 15 VMs •  Events need to be used in other contexts •  20+ minute time-to-search no longer acceptable
  • 10. 10 02Goals •  Improve customer experience – Fanatical Support! •  Provide search results faster •  Reduce technologies •  Reduce the amount of custom code •  Reduce the number of physical servers
  • 11. 11 02New System - Components •  Apache Flume: aggregation + processing •  Solr 1.4 to 4.x/5.x: NRT indexing, distributed search •  SolrCloud allowed us to reduce custom code by 75%
  • 14. 14 02Flume: backpressure + hop availability •  Sinks may be unreachable or slow •  File Channel = durable buffering •  capacity: disk / event size •  transactionCapacity: match source / sink •  minimumRequiredSpace
  • 15. 15 02Flume: batching and throughput •  Batch size is important •  File channels = slow •  Memory channels = fast •  “Loopback” flows
  • 16. 16 02Flume: controlling the flows •  One event, multiple uses •  Channel selectors •  Optional channels •  Interceptors agent.sources.avroSource.selector.type = multiplexing agent.sources.avroSource.selector.header = eventType agent.sources.avroSource.selector.default = defaultChannel agent.sources.avroSource.selector.authEvent = authEventChannel agent.sources.avroSource.selector.mailEvent = mailEventChannel agent.sources.avroSource.selector.optional.authEvent = optionalChannel
  • 17. 17 02Flume: Morphlines + Solr •  Works with SolrCloud •  Many helpful built-in commands •  Scripting support for Java •  Route to multiple collections •  Validate, modify events in-flight http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html
  • 18. 18 02Requirements for Solr •  Near real time indexing of 30,000+ docs per sec •  Few queries (< 10,000 per day) •  Heavy distributed facet/group/sort queries •  Support removing documents older than X days •  Minimize JVM GC impact on indexing performance
  • 19. 19 02Basic Solr install Server A Solr Replica Server B Solr Replica Server C Solr Replica Server D Solr Replica Collection Shard 1 Shard 2 ~2,500 docs per second Goal 30,000 (30,000/2,500 = 12) 12 * # of Servers = 48 total servers
  • 20. 20 02Consult the experts… •  Days of talking/100’s of emails with Rishi Easwaran •  Recommendations from Shalin Mangar •  [email protected] Result: •  Fewer physical servers •  Faster indexing
  • 21. 21 02Collections – Optimized for additions/deletions collection-2015-10-11 collection-2015-10-12 collection-2015-10-13 collection-2015-10-14 collection-2015-10-15collection-2015-10-16 •  Rolling collections by date •  ~1 billion documents removed •  Aliases for updates/queries •  25 shards - 2 replicas per shard
  • 22. 22 02JVM – Lean and mean •  4GB max/min JVM heap size •  5 Solr JVM processes per server •  Using Concurrent Mark Sweep GC •  GC only on very heavy queries •  GC < 10ms; occurs < 10 times a day •  No impact on index performance •  Reads 28 indexes; writes 2 indexes Server A Solr Server A Solr Solr Solr Solr Solr
  • 23. 23 02JVM Monitoring – before it’s too late •  Proactive OOM monitoring •  Memory not being released •  Trapped in GC •  Restart processes •  Can impact entire cluster
  • 24. 24 02autoCommit for near real time indexing Tested autoCommit and autoSoftCommit settings of: •  autoCommit 5 seconds to 5 minutes •  autoSoftCommit 1 second to 1 minute Result: •  autoSoftCommit of 5 seconds and autoCommit of 1 minute balanced out memory usage and disk IO
  • 25. 25 02DocValues – Reduced OOM Errors •  Struggled with OOME under heavy load •  Automated restart for nodes trapped in GC cycle •  Distributed facet/group/sort queries Solution: •  docValues=“true” – for facet/group/sort fields
  • 26. 26 02Caching/Cache Warming – Measure and tune •  filterCache/queryResultCache/documentCache/etc. •  Very diverse queries (cache hits were too low) •  Benefits for our workload did not justify the cost •  Willing to accept slower queries
  • 27. 27 02Configs - Keep it simple •  Example configs show off advanced features •  If you are not using the feature, turn it off •  Start with a trimmed down config •  Only add features as needed
  • 29. 29 02Present performance •  Sustained indexing of ~50,000 docs per sec •  Each replica indexes ~1,000 docs per sec •  New documents are searchable within 5 seconds •  10,000 distributed facet/group/sort queries per day •  1 billion new documents are indexed per day •  13 billion documents are searchable •  7TB of data across all indexes
  • 30. 30 02Performance Comparison Step Performance (2008) Performance (2015) Transport <1 minute <1 second (NRT) Index Generation 10 minutes <5 seconds Index Merge 10 minutes N/A Search 20+ minutes <5 seconds •  Faster transport •  No more batch processing •  No external index generation •  NRT indexing with SolrCloud
  • 31. 31 02Environment Comparison Server Type Servers (2008) Servers (2015) Transport Physical: 4 Virtual: 15 Physical: 4 Virtual: 20 Storage / processing Physical: 100+ Virtual: 0 Physical: 0 Virtual: 0 Search Physical: 12 Virtual: 0 Physical: 10 Virtual: 5 Total Physical: 100+ Virtual: 15 Physical: 14 Virtual: 25 •  Flume / Solr handle event storage and processing •  No more Hadoop footprint •  Over 80% reduction in servers
  • 32. 32 02Future… •  Dedicated Solr nodes with SSDs for indexing •  Shard query collections for improved performance •  Larger JVM size for query nodes •  Multiple Datacenter SolrCloud (replication/mirroring)
  • 33. Rackspace Email’s solution for indexing 50k documents per second George Bailey – Software Developer, Rackspace Cameron Baker – Linux Engineer, Rackspace [email protected] [email protected] Thank you