Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Rackspace Email’s solution for indexing 50k documents per second
George Bailey – Software Developer, Rackspace
Cameron Baker – Linux Engineer, Rackspace
george.bailey@rackspace.com
cameron.baker@rackspace.com

3
02Who we are…
•  “Rackers” dedicated to Fanatical Support!
•  Based out of San Antonio, TX
•  Part of Cloud Ofﬁce
•  Email infrastructure development and engineering

4
022008: Original Problem
•  1 million mailboxes
•  Support needs to track message delivery
•  We need event aggregation + search
•  Needed to provide Fanatical Support!
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

5
02Original System Design
•  Scribed: log aggregation, deposit into HDFS
•  Hadoop 0.20: index via mapreduce
•  solr 1.4: search
•  Custom tools: index loader, mapreduce, scheduler
Scribed

6
02Original System Architecture

7
02Past performance
Step Time
Transport < 1 minute
Index Generation
(Mapreduce)
10 minutes
(cron)
Index Merge
10 minutes
(cron)
Searchable Events 20+ minutes

8
027 years later…
•  4+ million mailboxes
•  Still running solr 1.4, hadoop 0.20, scribed
•  Scaling, maintenance issues
•  Grew to 100+ physical servers, 15 VMs
•  Events need to be used in other contexts
•  20+ minute time-to-search no longer acceptable

10
02Goals
•  Improve customer experience – Fanatical Support!
•  Provide search results faster
•  Reduce technologies
•  Reduce the amount of custom code
•  Reduce the number of physical servers

11
02New System - Components
•  Apache Flume: aggregation + processing
•  Solr 1.4 to 4.x/5.x: NRT indexing, distributed search
•  SolrCloud allowed us to reduce custom code by 75%

14
02Flume: backpressure + hop availability
•  Sinks may be unreachable or slow
•  File Channel = durable buffering
•  capacity: disk / event size
•  transactionCapacity: match source / sink
•  minimumRequiredSpace

15
02Flume: batching and throughput
•  Batch size is important
•  File channels = slow
•  Memory channels = fast
•  “Loopback” ﬂows

16
02Flume: controlling the ﬂows
•  One event, multiple uses
•  Channel selectors
•  Optional channels
•  Interceptors
agent.sources.avroSource.selector.type = multiplexing
agent.sources.avroSource.selector.header = eventType
agent.sources.avroSource.selector.default = defaultChannel
agent.sources.avroSource.selector.authEvent = authEventChannel
agent.sources.avroSource.selector.mailEvent = mailEventChannel
agent.sources.avroSource.selector.optional.authEvent = optionalChannel

17
02Flume: Morphlines + Solr
•  Works with SolrCloud
•  Many helpful built-in commands
•  Scripting support for Java
•  Route to multiple collections
•  Validate, modify events in-ﬂight
http://kitesdk.org/docs/current/morphlines/morphlines-reference-guide.html

18
02Requirements for Solr
•  Near real time indexing of 30,000+ docs per sec
•  Few queries (< 10,000 per day)
•  Heavy distributed facet/group/sort queries
•  Support removing documents older than X days
•  Minimize JVM GC impact on indexing performance

19
02Basic Solr install
Server A
Solr
Replica
Server B
Solr
Replica
Server C
Solr
Replica
Server D
Solr
Replica
Collection
Shard 1 Shard 2
~2,500 docs per second
Goal 30,000 (30,000/2,500 = 12)
12 * # of Servers = 48 total servers

20
02Consult the experts…
•  Days of talking/100’s of emails with Rishi Easwaran
•  Recommendations from Shalin Mangar
•  solr-user@lucene.apache.org
Result:
•  Fewer physical servers
•  Faster indexing

21
02Collections – Optimized for additions/deletions
collection-2015-10-11
collection-2015-10-15collection-2015-10-16
•  Rolling collections by date
•  ~1 billion documents removed
•  Aliases for updates/queries
•  25 shards - 2 replicas per shard

22
02JVM – Lean and mean
•  4GB max/min JVM heap size
•  5 Solr JVM processes per server
•  Using Concurrent Mark Sweep GC
•  GC only on very heavy queries
•  GC < 10ms; occurs < 10 times a day
•  No impact on index performance
•  Reads 28 indexes; writes 2 indexes
Server A
Solr
Server A
Solr
Solr
Solr
Solr
Solr

23
02JVM Monitoring – before it’s too late
•  Proactive OOM monitoring
•  Memory not being released
•  Trapped in GC
•  Restart processes
•  Can impact entire cluster

24
02autoCommit for near real time indexing
Tested autoCommit and autoSoftCommit settings of:
•  autoCommit 5 seconds to 5 minutes
•  autoSoftCommit 1 second to 1 minute
Result:
•  autoSoftCommit of 5 seconds and autoCommit of 1
minute balanced out memory usage and disk IO

25
02DocValues – Reduced OOM Errors
•  Struggled with OOME under heavy load
•  Automated restart for nodes trapped in GC cycle
•  Distributed facet/group/sort queries
Solution:
•  docValues=“true” – for facet/group/sort ﬁelds

26
02Caching/Cache Warming – Measure and tune
•  ﬁlterCache/queryResultCache/documentCache/etc.
•  Very diverse queries (cache hits were too low)
•  Beneﬁts for our workload did not justify the cost
•  Willing to accept slower queries

27
02Configs - Keep it simple
•  Example configs show off advanced features
•  If you are not using the feature, turn it off
•  Start with a trimmed down config
•  Only add features as needed

29
02Present performance
•  Sustained indexing of ~50,000 docs per sec
•  Each replica indexes ~1,000 docs per sec
•  New documents are searchable within 5 seconds
•  10,000 distributed facet/group/sort queries per day
•  1 billion new documents are indexed per day
•  13 billion documents are searchable
•  7TB of data across all indexes

30
02Performance Comparison
Step Performance (2008) Performance (2015)
Transport <1 minute <1 second (NRT)
Index
Generation
10 minutes <5 seconds
Index Merge 10 minutes N/A
Search 20+ minutes <5 seconds
•  Faster transport
•  No more batch processing
•  No external index generation
•  NRT indexing with SolrCloud

31
02Environment Comparison
Server Type Servers (2008) Servers (2015)
Transport
Physical: 4
Virtual: 15
Physical: 4
Virtual: 20
Storage /
processing
Physical: 100+
Virtual: 0
Physical: 0
Virtual: 0
Search
Physical: 12
Virtual: 0
Physical: 10
Virtual: 5
Total
Physical: 100+
Virtual: 15
Physical: 14
Virtual: 25
•  Flume / Solr handle event storage
and processing
•  No more Hadoop footprint
•  Over 80% reduction in servers

32
02Future…
•  Dedicated Solr nodes with SSDs for indexing
•  Shard query collections for improved performance
•  Larger JVM size for query nodes
•  Multiple Datacenter SolrCloud (replication/mirroring)

Rackspace Email’s solution for indexing 50k documents per second
George Bailey – Software Developer, Rackspace
Cameron Baker – Linux Engineer, Rackspace
george.bailey@rackspace.com
cameron.baker@rackspace.com
Thank you

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace (20)

More from Lucidworks (20)

Recently uploaded (20)

Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented by George Bailey & Cameron Baker, Rackspace