The document summarizes lessons learned from running software on a 2000-core cluster. Key issues included:
- RabbitMQ started refusing connections when there were too many, requiring establishing a RabbitMQ cluster.
- Third-party libraries like Enyim for memcached access failed under high load and were replaced with custom code.
- Tasks were split too finely, overwhelming RabbitMQ; larger-grained tasks improved performance.
- A global event logging system ("Greg") contained bugs that slowed debugging, showing the importance of tools. Fixing issues like unbounded buffers and clock synchronization bugs improved its usefulness.