SlideShare a Scribd company logo
Intelligent, Automatic Restarts for
Unhealthy Kafka Consumers on
Kubernetes
Chris Shepherd
● Cloudflare has been using Kafka in production since 2014.
● 14 distinct Kafka clusters, across multiple data centers.
● Running roughly 330 nodes.
● Some consumer groups handling 50m messages per second.
● Over 1 trillion messages processed!
Kafka at Cloudflare
Kafka applications on Kubernetes
Keeping Kubernetes
applications healthy
● Liveness probes: Ensures that an application within a container is
live and operational. The kubelet uses liveness probes to know
when to restart a container.
● Readiness probes: Used to determine when a container is ready to
start accepting traffic. A Pod is considered ready when all of its
containers are ready.
● Startup probes: Replace liveness checks on slow starting
containers to avoid them being killed before they are up & running.
Kubernetes health checks
Kubernetes health checks
● Kafka applications don’t typically accept HTTP traffic, so readiness
probes are redundant. For this reason we configure only liveness
checks on them.
● These checks are usually configured to start after a brief period
(e.g. 10s) and are executed at intervals to check if the application is
healthy (e.g. 15s).
● We also usually tolerate a reasonable amount of failed health
checks before restarting the pod (e.g. 3 times).
Health checks for Kafka applications
A naive approach to Kafka
health checks
A naive approach
Why it doesn’t always work
Noisy Oncall
A first iteration on smart
health checks
● This uses two values:
● Current (latest) offset - last message sent to the
topic
● Committed offset - last message processed by
consumer
● Check consumer is moving forwards by ensuring:
● Latest offset is changing - receiving new
messages
● Offsets are committed - consumer processing
new messages
A new approach!
Smart Health Check
Why it didn’t work for us
Our final approach
● Each replica only keeps track of it’s own
partitions offsets
● Sarama library has the functionality to
observe when a rebalancing happens
● We remove keys from the in-memory
offset map so it only includes the
relevant partition values
for {
select {
case message, ok := <-claim.Messages(): // <-- Message received
// Store latest received offset in-memory
offsetMap[message.Partition] = message.Offset
// Handle message
handleMessage(ctx, message)
// Commit message offset
session.MarkMessage(message, "")
case <-session.Context().Done(): // <-- Rebalance happened
// Remove rebalanced partition from in-memory map
delete(offsetMap, claim.Partition())
}
}
Another Gotcha!
● Helpful debug logging:
● Partition being consumed
● Offsets being committed
● Duration of health checks
● Metrics for visibility on health check
performance is vital
Visibility is important!
● Without proper thought, “dumb” health checks can lead to a false sense of security
that a service is running as expected even when it’s not.
● Good health checks can often be the difference between engineers being
called out to fix trivial issues and a service which is self-healing
● Besides initial teething problems, we have reduced the number of PagerDuty
alerts in our team by 50% since using smart health checks
● It’s important to think about specific behavior of the service and decide what
being unhealthy means in each instance, instead of just ensuring that
dependent services are connected.
Takeaways
Thank you!
chris-shepherd1993
Chris Shepherd Senior Systems Engineer
Check out the Cloudflare blog for more information on our smart health checks -
Intelligent, automatic restarts for unhealthy Kafka consumers
Ad

More Related Content

What's hot (20)

0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SRE
Araf Karsh Hamid
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf
Jo Hoon
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep Dive
Michal Rostecki
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
CJ Cullen
 
오픈스택: 구석구석 파헤쳐보기
오픈스택: 구석구석 파헤쳐보기오픈스택: 구석구석 파헤쳐보기
오픈스택: 구석구석 파헤쳐보기
Jaehwa Park
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
Building Microservices with Apache Kafka
Building Microservices with Apache KafkaBuilding Microservices with Apache Kafka
Building Microservices with Apache Kafka
confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PCF Cloud-Native Workshop Slides
PCF Cloud-Native Workshop SlidesPCF Cloud-Native Workshop Slides
PCF Cloud-Native Workshop Slides
VMware Tanzu
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment Strategies
Abdennour TM
 
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
Jo Hoon
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
confluent
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and GrafanaModern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
InfluxData
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
HostedbyConfluent
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
confluent
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
confluent
 
Microservices, DevOps & SRE
Microservices, DevOps & SREMicroservices, DevOps & SRE
Microservices, DevOps & SRE
Araf Karsh Hamid
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
Jiangjie Qin
 
[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf[GitOps] Argo CD on GKE (v0.9.2).pdf
[GitOps] Argo CD on GKE (v0.9.2).pdf
Jo Hoon
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep Dive
Michal Rostecki
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
CJ Cullen
 
오픈스택: 구석구석 파헤쳐보기
오픈스택: 구석구석 파헤쳐보기오픈스택: 구석구석 파헤쳐보기
오픈스택: 구석구석 파헤쳐보기
Jaehwa Park
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Henning Jacobs
 
Building Microservices with Apache Kafka
Building Microservices with Apache KafkaBuilding Microservices with Apache Kafka
Building Microservices with Apache Kafka
confluent
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
PCF Cloud-Native Workshop Slides
PCF Cloud-Native Workshop SlidesPCF Cloud-Native Workshop Slides
PCF Cloud-Native Workshop Slides
VMware Tanzu
 
Kubernetes Deployment Strategies
Kubernetes Deployment StrategiesKubernetes Deployment Strategies
Kubernetes Deployment Strategies
Abdennour TM
 
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
Jo Hoon
 
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
Bringing Kafka Without Zookeeper Into Production with Colin McCabe | Kafka Su...
HostedbyConfluent
 
Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips Kafka Security 101 and Real-World Tips
Kafka Security 101 and Real-World Tips
confluent
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
Redis Labs
 
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and GrafanaModern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
InfluxData
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
HostedbyConfluent
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
confluent
 

Similar to Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes with Chris Shepherd (20)

Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Dimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
arconsis
 
A new, efficiet coordinated checkpointing protocol combined with selective se...
A new, efficiet coordinated checkpointing protocol combined with selective se...A new, efficiet coordinated checkpointing protocol combined with selective se...
A new, efficiet coordinated checkpointing protocol combined with selective se...
Mumbai Academisc
 
Kubernetes Interview Question with answer by farshad nick
Kubernetes Interview Question with answer  by farshad nickKubernetes Interview Question with answer  by farshad nick
Kubernetes Interview Question with answer by farshad nick
farshad nickfetrat
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
Knoldus Inc.
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
Mohammed Shoaib
 
Presentation ON Real time system about Multiprocessor.pdf
Presentation ON Real time system about Multiprocessor.pdfPresentation ON Real time system about Multiprocessor.pdf
Presentation ON Real time system about Multiprocessor.pdf
ichbinrupesh
 
Kafka aws
Kafka awsKafka aws
Kafka aws
Ariel Moskovich
 
Operating System Notes help for interview pripration
Operating System Notes  help for interview priprationOperating System Notes  help for interview pripration
Operating System Notes help for interview pripration
ajaybiradar99999
 
Operating System Notes (1).pdf
Operating System Notes (1).pdfOperating System Notes (1).pdf
Operating System Notes (1).pdf
shriyashpatil7
 
Operating System Notes.pdf
Operating System Notes.pdfOperating System Notes.pdf
Operating System Notes.pdf
AminaArshad42
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
Murtadha Alsabbagh
 
seminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdfseminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdf
karunyamittapally
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
Thomas Alex
 
First steps with kubernetes
First steps with kubernetesFirst steps with kubernetes
First steps with kubernetes
Vinícius Kroth
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
Akhil Singh
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
confluent
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
Dimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
arconsis
 
A new, efficiet coordinated checkpointing protocol combined with selective se...
A new, efficiet coordinated checkpointing protocol combined with selective se...A new, efficiet coordinated checkpointing protocol combined with selective se...
A new, efficiet coordinated checkpointing protocol combined with selective se...
Mumbai Academisc
 
Kubernetes Interview Question with answer by farshad nick
Kubernetes Interview Question with answer  by farshad nickKubernetes Interview Question with answer  by farshad nick
Kubernetes Interview Question with answer by farshad nick
farshad nickfetrat
 
Presentation ON Real time system about Multiprocessor.pdf
Presentation ON Real time system about Multiprocessor.pdfPresentation ON Real time system about Multiprocessor.pdf
Presentation ON Real time system about Multiprocessor.pdf
ichbinrupesh
 
Operating System Notes help for interview pripration
Operating System Notes  help for interview priprationOperating System Notes  help for interview pripration
Operating System Notes help for interview pripration
ajaybiradar99999
 
Operating System Notes (1).pdf
Operating System Notes (1).pdfOperating System Notes (1).pdf
Operating System Notes (1).pdf
shriyashpatil7
 
Operating System Notes.pdf
Operating System Notes.pdfOperating System Notes.pdf
Operating System Notes.pdf
AminaArshad42
 
seminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdfseminarembedded-150504150805-conversion-gate02.pdf
seminarembedded-150504150805-conversion-gate02.pdf
karunyamittapally
 
Event driven architectures with Kinesis
Event driven architectures with KinesisEvent driven architectures with Kinesis
Event driven architectures with Kinesis
Mark Harrison
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
Thomas Alex
 
First steps with kubernetes
First steps with kubernetesFirst steps with kubernetes
First steps with kubernetes
Vinícius Kroth
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
Akhil Singh
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
confluent
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Ad

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 

Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes with Chris Shepherd

  • 1. Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes Chris Shepherd
  • 2. ● Cloudflare has been using Kafka in production since 2014. ● 14 distinct Kafka clusters, across multiple data centers. ● Running roughly 330 nodes. ● Some consumer groups handling 50m messages per second. ● Over 1 trillion messages processed! Kafka at Cloudflare
  • 5. ● Liveness probes: Ensures that an application within a container is live and operational. The kubelet uses liveness probes to know when to restart a container. ● Readiness probes: Used to determine when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. ● Startup probes: Replace liveness checks on slow starting containers to avoid them being killed before they are up & running. Kubernetes health checks
  • 7. ● Kafka applications don’t typically accept HTTP traffic, so readiness probes are redundant. For this reason we configure only liveness checks on them. ● These checks are usually configured to start after a brief period (e.g. 10s) and are executed at intervals to check if the application is healthy (e.g. 15s). ● We also usually tolerate a reasonable amount of failed health checks before restarting the pod (e.g. 3 times). Health checks for Kafka applications
  • 8. A naive approach to Kafka health checks
  • 10. Why it doesn’t always work
  • 12. A first iteration on smart health checks
  • 13. ● This uses two values: ● Current (latest) offset - last message sent to the topic ● Committed offset - last message processed by consumer ● Check consumer is moving forwards by ensuring: ● Latest offset is changing - receiving new messages ● Offsets are committed - consumer processing new messages A new approach!
  • 15. Why it didn’t work for us
  • 16. Our final approach ● Each replica only keeps track of it’s own partitions offsets ● Sarama library has the functionality to observe when a rebalancing happens ● We remove keys from the in-memory offset map so it only includes the relevant partition values for { select { case message, ok := <-claim.Messages(): // <-- Message received // Store latest received offset in-memory offsetMap[message.Partition] = message.Offset // Handle message handleMessage(ctx, message) // Commit message offset session.MarkMessage(message, "") case <-session.Context().Done(): // <-- Rebalance happened // Remove rebalanced partition from in-memory map delete(offsetMap, claim.Partition()) } }
  • 18. ● Helpful debug logging: ● Partition being consumed ● Offsets being committed ● Duration of health checks ● Metrics for visibility on health check performance is vital Visibility is important!
  • 19. ● Without proper thought, “dumb” health checks can lead to a false sense of security that a service is running as expected even when it’s not. ● Good health checks can often be the difference between engineers being called out to fix trivial issues and a service which is self-healing ● Besides initial teething problems, we have reduced the number of PagerDuty alerts in our team by 50% since using smart health checks ● It’s important to think about specific behavior of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected. Takeaways
  • 20. Thank you! chris-shepherd1993 Chris Shepherd Senior Systems Engineer Check out the Cloudflare blog for more information on our smart health checks - Intelligent, automatic restarts for unhealthy Kafka consumers