"I will be presenting how we do the smart/automated capacity management on Multi-tenant Kafka cluster in Booking.com. It was a long journey. In this end to end story, I will be presenting what the issues were at the beginning, how we came up with a plan, designed, implemented, and applied to our existing clusters smoothly, now how the clients can monitor and even get alerted before their reserved capacity has been reached. What were the challenges and our learnings? What is next? Why? In Booking.com, the infra team manages 60 different Kafka clusters with hundreds of topics in each. There are clusters running with hundred brokers. As there are hundreds of Kafka clients from tens of different departments, it is high likely some of the clients start abusing the cluster. Especially during peak times, when the retention was set as retention.ms, or when the underlying message size changes, it is hard to predict what would be the occupied storage in total. Finding the relevant clients, deciding which data to discard, dealing with so many unknowns in a short period of time can be hassle. Also these are not fun activities but just a toil for the team. What? To avoid such boring issues, the team has chosen the path to build a smart mechanism and have quotas in place. It helped saving time developing new features instead of chasing people to resolve collisions. You can think that as an extension to the built-in throttling producer/consumer rate limits provided by the Apache Kafka, but it is much more than that. There are several components will be explained during the presentation one of them is our control plane (custom built) which manages the communication between clients and servers and does many things automated. Another one is the Custom Policies that we plugged in on the Kafka side to validate the configuration even tried (malicious configuration) on the server side. The talk guarantees learning and shows examples of Kafka at scale problems in Booking.com."