SlideShare a Scribd company logo
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka at Peak Performance
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load (whole site)
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound
4
 1800+ Kafka brokers
 Over 79,000 topics
 1,130,000+ Partitions
 1.3 Trillion messages per day
 330 Terabytes In
 1.2 Petabytes Out
 Peak Load (single cluster)
– 2 Million messages/sec
– 4.7 Gigabits/sec Inbound
– 15 Gigabits/sec Outbound
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Picking Your Hardware
 Monitoring the Cluster
 Triaging Broker Performance Problems
 Conclusion
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Hardware Selection
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What’s Important To You?
 Message Retention - Disk size
 Message Throughput - Network capacity
 Producer Performance - Disk I/O
 Consumer Performance - Memory
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Go Wide
 Kafka is well-suited to horizontal scaling
 RAIS - Redundant Array of Inexpensive Servers
 Also helps with CPU utilization
– Kafka needs to decompress and recompress every message batch
– KIP-31 will help with this by eliminating recompression
 Don’t co-locate Kafka
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Disk Layout
 RAID
– Can survive a single disk failure (not RAID 0)
– Provides the broker with a single log directory
– Eats up disk I/O
 JBOD
– Gives Kafka all the disk I/O available
– Broker is not smart about balancing partitions
– If one disk fails, the entire broker stops
 Amazon EBS performance works!
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Operating System Tuning
 Filesystem Options
– EXT or XFS
– Using unsafe mount options
 Virtual Memory
– Swappiness
– Dirty Pages
 Networking
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Java
 Only use JDK 8 now
 Keep heap size small
– Even our largest brokers use a 6 GB heap
– Save the rest for page cache
 Garbage Collection - G1 all the way
– Basic tuning only
– Watch for humongous allocations
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
How Much Do You Need?
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
13
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Possible Broker Improvements
 Namespaces
– Namespace topics by datacenter
– Eliminate local clusters and just have aggregate
– Significant hardware savings
 JBOD Fixes
– Intelligent partition assignment
– Admin tools to move partitions between mount points
– Broker should not fail completely with a single disk failure
16
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Keeping An Eye On Things
18
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring The Foundation
 CPU Load
 Network inbound and outbound
 Filehandle usage for Kafka
 Disk
– Free space - where you write logs, and where Kafka stores messages
– Free inodes
– I/O performance - at least average wait and percent utilization
 Garbage Collection
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Ground Rules
 Tuning
– Stick (mostly) with the defaults
– Set default cluster retention as appropriate
– Default partition count should be at least the number of brokers
 Monitoring
– Watch the right things
– Don’t try to alert on everything
 Triage and Resolution
– Solve problems, don’t mask them
20
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Too Much Information!
 Monitoring teams hate Kafka
– Per-Topic metrics
– Per-Partition metrics
– Per-Client metrics
 Capture as much as you can
– Many metrics are useful while triaging an issue
 Clients want metrics on their own topics
 Only alert on what is needed to signal a problem
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Monitoring
 Bytes In, Bytes Out
 Messages In, Produce Rate, Produce Failure Rate
 Fetch Rate, Fetch Failure Rate
 Partition Bytes
 Log End Offset
– Why bother?
– KIP-32 will make this unnecessary
 Quota Throttling
 Provide this to your customers for them to alert on
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Client Monitoring
 For consumers, use Burrow
– Monitor all partitions for all consumers
– Provides an easy to digest “good, warning, bad” state, with detail available
– Fast and free
 Producers are a little harder
– Several internal implementations of message auditing
– The community needs a good open source standard
 Cluster availability monitoring
– kafka-monitoring is coming soon from LinkedIn!
24
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
It’s Broken! Now What?
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
All The Best Ops People…
 Know more of what is happening than their customers
 Are proactive
 Fix bugs, not work around them
 This applies to our developers too!
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Anticipating Trouble
 Trend cluster utilization and growth over time
 Use default configurations for quotas and retention to require customers to
talk to you
 Monitor request times
– If you are able to develop a consistent baseline, this is early warning
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Under Replicated Partitions
 Count of number of partitions which are not fully replicated within the
cluster
 Also referred to as “replica lag”
 Primary indicator of problems within the cluster
28
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Performance Checks
 Are you still running 0.8?
 Are all the brokers in the cluster working?
 Are the network interfaces saturated?
– Reelect partition leaders
– Rebalance partitions in the cluster
– Spread out traffic more (increase partitions or brokers)
 Is the CPU utilization high? (especially iowait)
– Is another process competing for resources?
– Look for a bad disk
 Do you have really big messages?
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka’s OK, Now What?
 If Kafka is working properly, it’s probably a client issue
– Don’t throw it over the fence. Help your customers understand
 Common producer issues
– Batch size and linger time
– Receive and send buffers
– Sync vs. async, and acknowledgements
 Common consumer issues
– Garbage collection problems
– Min fetch bytes and max wait time
– Not enough partitions
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Ecosystem
 Kafka can scale to millions of messages per second, and more
– Operations must scale the cluster appropriately
– Developers must use the right tuning and go parallel
 Few problems are owned by only one side
– Expanding partitions often requires coordination
– Applications that need higher reliability drive cluster configurations
 Either we work together, or we fail separately
32
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Would You Like To Know More?
 Presentations: http://www.slideshare.net/toddpalino
– More Datacenters, More Problems
– Kafka As A Service
– Always download the originals for slide notes!
 Blog Posts: https://engineering.linkedin.com/blog
– Development and SRE blogs on Kafka and other topics
 LinkedIn Open Source: https://github.com/linkedin/streaming
– Burrow Consumer Monitoring - https://github.com/linkedin/Burrow
– Kafka Admin Tools - https://github.com/linkedin/kafka-tools
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 LinkedIn
– Strong commitment to open source
– Do cool things and work with awesome people
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
35
Kafka at Peak Performance
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Appendix
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
JDK Options
Heap Size -Xmx6g -Xms6g
Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80
G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M
GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-Xloggc:/path/to/logs/gc.log -verbose:gc
Error Handling -XX:-HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/path/to/logs/hs_err.log
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters
 Networking:
net.core.rmem_default = 124928
net.core.rmem_max = 2048000
net.core.wmem_default = 124928
net.core.wmem_max = 2048000
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_max_syn_backlog = 1024
39
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters (cont.)
 Virtual Memory
vm.oom_kill_allocating_task = 1
vm.max_map_count = 200000
vm.swappiness = 1
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 500
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5
40
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics
kafka.server:name=PartitionCount,type=ReplicaManager
kafka.server:name=LeaderCount,type=ReplicaManager
kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager
kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool
kafka.controller:name=ActiveControllerCount,type=KafkaController
kafka.controller:name=OfflinePartitionsCount,type=KafkaController
kafka.log:name=max-dirty-percent,type=LogCleanerManager
kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer
kafka.network:name=RequestsPerSec=*,type=RequestMetrics
kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics
kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics
kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics
41
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors - Topics
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*
42
Ad

More Related Content

What's hot (20)

Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
confluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
Karwin Software Solutions LLC
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
confluent
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
HostedbyConfluent
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
confluent
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
HostedbyConfluent
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
confluent
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
confluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
confluent
 
Apache Kafka® Security Overview
Apache Kafka® Security OverviewApache Kafka® Security Overview
Apache Kafka® Security Overview
confluent
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
HostedbyConfluent
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
confluent
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward
 

Viewers also liked (6)

Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Challenges of a multi tenant kafka service
Challenges of a multi tenant kafka serviceChallenges of a multi tenant kafka service
Challenges of a multi tenant kafka service
Thomas Alex
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
Todd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
Todd Palino
 
Challenges of a multi tenant kafka service
Challenges of a multi tenant kafka serviceChallenges of a multi tenant kafka service
Challenges of a multi tenant kafka service
Thomas Alex
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
Todd Palino
 
Microsoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka serviceMicrosoft challenges of a multi tenant kafka service
Microsoft challenges of a multi tenant kafka service
Nitin Kumar
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
Todd Palino
 
Ad

Similar to Kafka at Peak Performance (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
Bill Liu
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
Anas Kanzoua
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
Filipe Miranda
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
Mario Beck
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
Aerospike, Inc.
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
Ratish Ravindran
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
Severalnines
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
Felix GV
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
Brocade
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Dan Cundiff
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
AreaNetworking.it
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Group, Inc.
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 
Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
Todd Palino
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
Bill Liu
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
Anas Kanzoua
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
Filipe Miranda
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
MongoDB
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
AdobeMarketingCloud
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
Mario Beck
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
Aerospike, Inc.
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
Ratish Ravindran
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
Severalnines
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
Felix GV
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
Brocade
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Dan Cundiff
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
AreaNetworking.it
 
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Ichiba_Rakuten Technology Conference 2016
Rakuten Group, Inc.
 
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB World 2018: Managing a Mission Critical eCommerce Application on Mong...
MongoDB
 
Ad

More from Todd Palino (9)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
Todd Palino
 
Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
Todd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
Todd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Todd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
Todd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
Todd Palino
 

Recently uploaded (20)

Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 

Kafka at Peak Performance

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 4  1800+ Kafka brokers  Over 79,000 topics  1,130,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Picking Your Hardware  Monitoring the Cluster  Triaging Broker Performance Problems  Conclusion 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Hardware Selection 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What’s Important To You?  Message Retention - Disk size  Message Throughput - Network capacity  Producer Performance - Disk I/O  Consumer Performance - Memory 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Go Wide  Kafka is well-suited to horizontal scaling  RAIS - Redundant Array of Inexpensive Servers  Also helps with CPU utilization – Kafka needs to decompress and recompress every message batch – KIP-31 will help with this by eliminating recompression  Don’t co-locate Kafka 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Disk Layout  RAID – Can survive a single disk failure (not RAID 0) – Provides the broker with a single log directory – Eats up disk I/O  JBOD – Gives Kafka all the disk I/O available – Broker is not smart about balancing partitions – If one disk fails, the entire broker stops  Amazon EBS performance works! 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Operating System Tuning  Filesystem Options – EXT or XFS – Using unsafe mount options  Virtual Memory – Swappiness – Dirty Pages  Networking 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Java  Only use JDK 8 now  Keep heap size small – Even our largest brokers use a 6 GB heap – Save the rest for page cache  Garbage Collection - G1 all the way – Basic tuning only – Watch for humongous allocations 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. How Much Do You Need? 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 13 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Possible Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Keeping An Eye On Things 18
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring The Foundation  CPU Load  Network inbound and outbound  Filehandle usage for Kafka  Disk – Free space - where you write logs, and where Kafka stores messages – Free inodes – I/O performance - at least average wait and percent utilization  Garbage Collection 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Ground Rules  Tuning – Stick (mostly) with the defaults – Set default cluster retention as appropriate – Default partition count should be at least the number of brokers  Monitoring – Watch the right things – Don’t try to alert on everything  Triage and Resolution – Solve problems, don’t mask them 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Too Much Information!  Monitoring teams hate Kafka – Per-Topic metrics – Per-Partition metrics – Per-Client metrics  Capture as much as you can – Many metrics are useful while triaging an issue  Clients want metrics on their own topics  Only alert on what is needed to signal a problem 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Monitoring  Bytes In, Bytes Out  Messages In, Produce Rate, Produce Failure Rate  Fetch Rate, Fetch Failure Rate  Partition Bytes  Log End Offset – Why bother? – KIP-32 will make this unnecessary  Quota Throttling  Provide this to your customers for them to alert on 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Client Monitoring  For consumers, use Burrow – Monitor all partitions for all consumers – Provides an easy to digest “good, warning, bad” state, with detail available – Fast and free  Producers are a little harder – Several internal implementations of message auditing – The community needs a good open source standard  Cluster availability monitoring – kafka-monitoring is coming soon from LinkedIn! 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. It’s Broken! Now What? 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. All The Best Ops People…  Know more of what is happening than their customers  Are proactive  Fix bugs, not work around them  This applies to our developers too! 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Anticipating Trouble  Trend cluster utilization and growth over time  Use default configurations for quotas and retention to require customers to talk to you  Monitor request times – If you are able to develop a consistent baseline, this is early warning 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Under Replicated Partitions  Count of number of partitions which are not fully replicated within the cluster  Also referred to as “replica lag”  Primary indicator of problems within the cluster 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Performance Checks  Are you still running 0.8?  Are all the brokers in the cluster working?  Are the network interfaces saturated? – Reelect partition leaders – Rebalance partitions in the cluster – Spread out traffic more (increase partitions or brokers)  Is the CPU utilization high? (especially iowait) – Is another process competing for resources? – Look for a bad disk  Do you have really big messages? 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka’s OK, Now What?  If Kafka is working properly, it’s probably a client issue – Don’t throw it over the fence. Help your customers understand  Common producer issues – Batch size and linger time – Receive and send buffers – Sync vs. async, and acknowledgements  Common consumer issues – Garbage collection problems – Min fetch bytes and max wait time – Not enough partitions 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Ecosystem  Kafka can scale to millions of messages per second, and more – Operations must scale the cluster appropriately – Developers must use the right tuning and go parallel  Few problems are owned by only one side – Expanding partitions often requires coordination – Applications that need higher reliability drive cluster configurations  Either we work together, or we fail separately 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Would You Like To Know More?  Presentations: http://www.slideshare.net/toddpalino – More Datacenters, More Problems – Kafka As A Service – Always download the originals for slide notes!  Blog Posts: https://engineering.linkedin.com/blog – Development and SRE blogs on Kafka and other topics  LinkedIn Open Source: https://github.com/linkedin/streaming – Burrow Consumer Monitoring - https://github.com/linkedin/Burrow – Kafka Admin Tools - https://github.com/linkedin/kafka-tools 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – [email protected][email protected]  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  LinkedIn – Strong commitment to open source – Do cool things and work with awesome people  Join us in working on cutting edge stream processing infrastructures – Please contact [email protected] – Software developers and Site Reliability Engineers at all levels 35
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Appendix 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. JDK Options Heap Size -Xmx6g -Xms6g Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters  Networking: net.core.rmem_default = 124928 net.core.rmem_max = 2048000 net.core.wmem_default = 124928 net.core.wmem_max = 2048000 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_max_syn_backlog = 1024 39
  • 40. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters (cont.)  Virtual Memory vm.oom_kill_allocating_task = 1 vm.max_map_count = 200000 vm.swappiness = 1 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 vm.dirty_ratio = 60 vm.dirty_background_ratio = 5 40
  • 41. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics kafka.server:name=PartitionCount,type=ReplicaManager kafka.server:name=LeaderCount,type=ReplicaManager kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool kafka.controller:name=ActiveControllerCount,type=KafkaController kafka.controller:name=OfflinePartitionsCount,type=KafkaController kafka.log:name=max-dirty-percent,type=LogCleanerManager kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer kafka.network:name=RequestsPerSec=*,type=RequestMetrics kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics 41
  • 42. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors - Topics kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.log:type=Log,name=LogEndOffset,topic=*,partition=* 42