Tom Crayford discusses his experience running hundreds of Apache Kafka clusters on Heroku with a small team. Some key points discussed include: - Using automation to manage clusters and reduce manual work required - Common issues encountered like disk growth from log compaction bugs and addressing them by scanning clusters for anomalies - Kafka's built-in high availability and how it helped during an AWS EBS failure event - Novel failure cases encountered like a JVM memory leak from gzip usage and working to fix it - Importance of taking breaks and not wasting time when operating clusters at scale.