How is Kafka so Fast?

Apr 9, 2019Download as PPTX, PDF5 likes1,381 views

I explain the key points of Kafka that makes it a very fast and reliable distributed message bus even using mechanical disks.

Ricardo Paiva and Hervé Rivière
Understanding the design of Kafka and how it
handles Criteo workload
How is Kafka so
fast?

3 •
Apache Kafka is a distributed message queue
• Open-sourced by LinkedIn in 2011
• High-throughput
• Highly distributed
• Fault-tolerant
• Low-latency
What is Kafka?

4 •
• Use case
• GLUP pipeline (aka Kafka Local)
• Streaming event processing platform (aka Kafka Stream)
• Some figures :
• 14 clusters / 200 servers / 7 DC
• Up to 7 millions messages / sec
• Up to 150 TB processed per day
Kafka @ Criteo ?

5 •
Topics, Partitions and Offsets
7 6 5 4 3 2 1 08910
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 08
7 6 5 4 3 2 1 0891011
7 6 5 4 3 2 1 08
7 6 5 4 3 2 1 0
Partition 0
Partition 1
Partition 0
Partition 1
Partition 2
Partition 3
OldNew
Writes
Topic A
Topic B

7 •
Brokers
• Manage partitions
• Receive from producer records for a (topic, partition)
• Answer to consumer asking records for (topic, partition, offset)
• Manage replicas
• Manage consumer coordination
• Assigning good partitions to the good consumer
Broker 1
Producer
Broker 2
Consumer
Consumer
Fetch (Topic A, Partition 4, Offset 10)
Bytes
Fetch (Topic B, Partition 1, Offset 10)
Bytes

8 •
Producers
Producer
Broker
(partition leader)
Broker
(replica)
Broker
(replica)
ack
• Producers decide what partitions to send to;
• Producers can send a batch of messages;
• Producers can compress a batch;
• Producers wait for acknowledgement from the broker (acks=1) or broker + replica (acks=all);

9 •
Consumers
ConsumerBroker
6 7 8 9 10 11 12 12
offset=7
Partition 2:
Partition 2, offset 6
7 8 9
1
2
3Commit offset=9
• Consumers control what offset to consume from;
• Consumers commit offsets to kafka, but it’s just another Kafka topic;
• Consumers can receive batched and / or compressed data;
• Kafka coordinates which partitions each consumer will consume from.

12 •
• Each Kafka partition is mapped to segment files
• Segment file : log append structure
• Records are immutable
• Broker is doing very few random disk search
Only sequential I/O
Kafka
Active
Segment
file
Old
segment
files

13 •
• Kafka relies on native Linux Page cache (read-ahead and write-behind)
• JVM off-heap cache for free
• Kafka records aren’t deserialized in Kafka JVM
• No Java object memory overhead
• No OutOfMemory issue
• No big GC pauses
Caching data for free
Kafka
Active
Segment
file
Disk
OS
Old segments files

14 •
Reliability with replication
• Kafka disk writes are asynchronous
• Kafka replicas synchronisation (over network) is synchronous
• Trusting replicas in case of data corruption / server crash
Broker
(partition leader)
Broker
(replica)
Broker
(previous
leader)

16 •
Sending data from file to network (traditional approach)
read(file, tmp_buf, len);
write(socket, tmp_buf, len);

17 •
Sending data from file to network (zero-copy approach)
transferTo(position, count, writableChannel);

19 •
• Paralelism based on topic partitions;
• Data compressed/uncompressed on the client;
• Producers send a batch of messages;
• No serialization/deserialization costs on the brokers;
• Writing directly to file:
• Append only (cheapers disks);
• No complex data structure (no BTree or LSM tree);
• Uses OS memory management;
• Relies on replicas not on disks;
• Zero-copy;
Key takeaways

This document discusses Criteo's large Kafka infrastructure in Europe. Some key details: - Criteo uses Kafka to process up to 7 million messages per second (400 billion per day) across about 200 brokers in 13 Kafka clusters across multiple datacenters. - They have developed an in-house C# Kafka client optimized for their high-throughput use case of no key partitioning and no order guarantees. - Criteo monitors lag and message ordering using "watermark" messages containing timestamps that are tracked across partitions to measure stream processing lag. - Data is replicated between clusters for redundancy using custom Kafka Connect connectors that write offsets to the destination.

Kibana + timelion: time series with the elastic stackSylvain Wallez

The document discusses Kibana and Timelion, which are tools for visualizing and analyzing time series data in the Elastic Stack. It provides an overview of Kibana's evolution and capabilities for creating dashboards. Timelion is introduced as a scripting language that allows users to transform, aggregate, and calculate on time series data from multiple sources to create visualizations. The document demonstrates Timelion's expression language, which includes functions, combinations, filtering, and attributes to process and render time series graphs.

Microservices in the Apache Kafka Ecosystemconfluent

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah

Monitoring Apache Kafkaconfluent

Monitoring Apache Kafka When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean? Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines. In this presentation, we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.

Stream Processing with Apache Kafka and .NETconfluent

Presentation from South Bay.NET meetup on 3/30. Speaker: Matt Howlett, Software Engineer at Confluent Apache Kafka is a scalable streaming platform that forms a key part of the infrastructure at many companies including Uber, Netflix, Walmart, Airbnb, Goldman Sachs and LinkedIn. In this talk Matt will give a technical overview of Kafka, discuss some typical use cases (from surge pricing to fraud detection to web analytics) and show you how to use Kafka from within your C#/.NET applications.

Effectively-once semantics in Apache PulsarMatteo Merli

Apache Pulsar @SplunkKarthik Ramasamy

Kafka presentationMohammed Fazuluddin

Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.

Kafka 101Clement Demonchy

Apache Kafka IntroductionAmita Mirajkar

A New Way of Thinking | NATS 2.0 & ConnectivityNATS

NATS 2.0 is the largest feature release since the original code base for the server was released. NATS 2.0 was created to allow a new way of thinking about NATS as a shared utility, solving problems at scale through distributed security, multi-tenancy, larger networks, and secure sharing of data. In this presentation, Derek discusses the motives behind the newest features of NATS and how to leverage them to reduce total cost of ownership, decrease time to value, support extremely large scale deployments, and decentralize security to create secure and easy to manage modern distributed systems.

Introduction to Apache KafkaJeff Holoman

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop. It's also enabling many real-time system frameworks and use cases. Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API. Also talk about the best practices involved in running a producer/consumer. In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects. We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing Kafka ACLs and monitoring Consumer offsets.

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters

Salvatore Sanfilippo – How Redis Cluster works, and why In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.

How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.

Apache Kafka - Messaging System OverviewDmitry Tolpeko

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

This document summarizes a presentation about Apache Kafka. It introduces Apache Kafka as a modern, distributed platform for data streams made up of distributed, immutable, append-only commit logs. It describes Kafka's scalability similar to a filesystem and guarantees similar to a database, with the ability to rewind and replay data. The document discusses Kafka topics and partitions, partition leadership and replication, and provides resources for further information.

Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar

Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.

Deep Dive into Apache Kafkaconfluent

In the last few years, Apache Kafka has been used extensively in enterprises for real-time data collecting, delivering, and processing. In this presentation, Jun Rao, Co-founder, Confluent, gives a deep dive on some of the key internals that help make Kafka popular. - Companies like LinkedIn are now sending more than 1 trillion messages per day to Kafka. Learn about the underlying design in Kafka that leads to such high throughput. - Many companies (e.g., financial institutions) are now storing mission critical data in Kafka. Learn how Kafka supports high availability and durability through its built-in replication mechanism. - One common use case of Kafka is for propagating updatable database records. Learn how a unique feature called compaction in Apache Kafka is designed to solve this kind of problem more naturally.

Handle Large Messages In Apache KafkaJiangjie Qin

Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.

Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Vietnam Open Infrastructure User Group

This document discusses upgrading an Openstack network to SDN with Tungsten Fabric. It evaluates three solutions: 1) using the same database across regions, 2) hot-swapping Open vSwitch and virtual routers, and 3) using an ML2 plugin. The recommended solution is #3 as it provides minimum downtime. Key steps include installing the OpenContrail driver, synchronizing network resources between Openstack and Tungsten, and live migrating VMs. Topology 2 is also recommended as it requires minimum changes. The upgrade migrated 80 VMs and 16 compute nodes to the SDN network without downtime. Issues discussed include synchronizing resources and migrating VMs between Open vSwitch and virtual routers.

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Diving into the Deep End - Kafka Connectconfluent

Dennis Wittekind, Confluent, Senior Customer Success Engineer Perhaps you have heard of Kafka Connect and think it would be a great fit in your application's architecture, but you like to know how things work before you propose them to your team? Perhaps you know enough Connect to be dangerous, but you haven't had the time to really understand all the moving pieces? This meetup talk is for you! We'll briefly introduce Connect to the uninitiated, and then jump in to underlying concepts and considerations you should make when running Connect in production! We'll even run a live demo! What could go wrong!? https://www.meetup.com/Saint-Louis-Kafka-meetup-group/events/272687113/

OpenTelemetry For ArchitectsKevin Brockhoff

Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko

Scalability, Availability & Stability PatternsJonas Bonér

This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.

Fundamentals of Apache KafkaChhavi Parasher

Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent

Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways. Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space. Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka. -Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers. -Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions. -Understand how Capital One manages Kafka Docker containers using Kubernetes. Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.

More Related Content

What's hot (20)

Kafka presentationMohammed Fazuluddin

Kafka 101Clement Demonchy

Apache Kafka IntroductionAmita Mirajkar

A New Way of Thinking | NATS 2.0 & ConnectivityNATS

Introduction to Apache KafkaJeff Holoman

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters

How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit

Apache Kafka - Messaging System OverviewDmitry Tolpeko

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar

Deep Dive into Apache Kafkaconfluent

Handle Large Messages In Apache KafkaJiangjie Qin

Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Vietnam Open Infrastructure User Group

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Diving into the Deep End - Kafka Connectconfluent

OpenTelemetry For ArchitectsKevin Brockhoff

Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko

Scalability, Availability & Stability PatternsJonas Bonér

Kafka presentationMohammed Fazuluddin

Kafka 101Clement Demonchy

Apache Kafka IntroductionAmita Mirajkar

A New Way of Thinking | NATS 2.0 & ConnectivityNATS

Introduction to Apache KafkaJeff Holoman

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters

How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit

Apache Kafka - Messaging System OverviewDmitry Tolpeko

Apache Kafka Fundamentals for Architects, Admins and Developersconfluent

Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar

Deep Dive into Apache Kafkaconfluent

Handle Large Messages In Apache KafkaJiangjie Qin

Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...Vietnam Open Infrastructure User Group

Hello, kafka! (an introduction to apache kafka)Timothy Spann

Diving into the Deep End - Kafka Connectconfluent

OpenTelemetry For ArchitectsKevin Brockhoff

Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko

Scalability, Availability & Stability PatternsJonas Bonér

Similar to How is Kafka so Fast? (20)

Fundamentals of Apache KafkaChhavi Parasher

Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent

Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang

To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.

Stream Processing @ LyftJamie Grier

Lyft's streaming platform uses Apache Flink for stream processing and Apache Kafka for messaging. Flink was chosen for its capabilities around state management, exactly-once processing, and flexible APIs. Kafka was chosen for its durability, low latency, and consumer fanout. However, open problems remain around rescaling Kafka while preserving per-key ordering, enabling dynamic stream computations, long-term event storage, and zero downtime deployments. Lyft is working to solve these challenges as it builds out its next generation streaming platform.

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent

RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.

Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent

William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time. At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings. In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.

Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang

Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!ScyllaDB

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Real time data pipline with kafka streamsYoni Farin

Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Erik Onnen

The document discusses Urban Airship's use of Apache Kafka for processing continuous data streams. It describes how Urban Airship uses Kafka for analytics, operational data, and presence data. Producers write device data to Kafka topics, and consumers create indexes from the data in databases like HBase and write to operational data warehouses. The document also covers Kafka concepts, best use cases, limitations, and examples of data structures for storing device metadata in Kafka streams.

Kafka ExplainatonNguyenChiHoangMinh

Making Apache Kafka Even Faster And More ScalablePaulBrebner2

Distributed messaging through KafkaDileep Kalidindi

This document discusses new age distributed messaging using Apache Kafka. It begins with an introduction to Kafka concepts like topics, partitions, producers and consumers. It then explains how Kafka uses commit log architecture and an append-only log structure to provide high throughput performance. The document also covers how Zookeeper is used to coordinate Kafka brokers and keep metadata. It evaluates Kafka's performance based on LinkedIn benchmarks, finding that its lack of acknowledgements, batching and storage format allow for very fast publishing and consumption of messages. In conclusion, the document suggests Kafka could be introduced in some parts of Responsys' architecture to handle big data workloads.

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

This document introduces Kafka Streams and provides an example of using it to process streaming data from Apache Kafka. It summarizes some key limitations of using Apache Spark for streaming use cases with Kafka before demonstrating how to build a simple text processing pipeline with Kafka Streams. The document also discusses parallelism, state stores, aggregations, joins and deployment considerations when using Kafka Streams. It provides an example of how Kafka Streams was used to aggregate metrics from multiple instances of an application into a single stream.

Web Analytics using Kafka - August talk w/ Women Who CodePurnima Kamath

Purnima Kamath's presentation discusses using Apache Kafka for web analytics. It introduces Kafka as a distributed commit log that can throttle high volumes of event data from web servers to prevent request drop-offs. The presentation covers Kafka's publish-subscribe model using topics and partitions, how it guarantees ordering and allows for replays. It also demonstrates how Kafka Streams enables real-time extract, transform, load operations on streaming data and maintains application state in local stores. The demo shows a sample web analytics pipeline using Kafka to capture device, gender, browser and preference change events.

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu

This document provides an agenda for a hands-on introduction and hackathon kickoff for Apache Geode. The agenda includes details about the hackathon, an introduction to Apache Geode including its history and key features, a hands-on lab to build, run, and use Geode, and a Q&A session. It also outlines how to contribute to the Geode project through code, documentation, issue tracking, and mailing lists.

Apache Performance Tuning: Scaling OutSander Temme

World of Tanks Experience of Using KafkaLevon Avakyan

Tuning kafka pipelinesSumant Tambe

Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance. Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.

Fundamentals of Apache KafkaChhavi Parasher

Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent

Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang

Stream Processing @ LyftJamie Grier

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent

Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent

Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang

Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!ScyllaDB

Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent

Real time data pipline with kafka streamsYoni Farin

Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Erik Onnen

Kafka ExplainatonNguyenChiHoangMinh

Making Apache Kafka Even Faster And More ScalablePaulBrebner2

Distributed messaging through KafkaDileep Kalidindi

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent

Web Analytics using Kafka - August talk w/ Women Who CodePurnima Kamath

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu

Apache Performance Tuning: Scaling OutSander Temme

World of Tanks Experience of Using KafkaLevon Avakyan

Tuning kafka pipelinesSumant Tambe

Recently uploaded (20)

TrsLabs - Fintech Product & Business ConsultingTrs Labs

Hybrid Growth Mandate Model with TrsLabs Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant. An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices. Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company Talk to us & Unlock the competitive advantage

AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB

I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada

Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next. Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/ Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.

Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.

Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex

Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how: • Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules. • Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance. • Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity. • Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications. • Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market. With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications. Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family

Build Your Own Copilot & Agents For DevsBrian McKeiver

Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma

Heap, Types of Heap, Insertion and DeletionJaydeep Kale

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next. Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/ Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

AI and Data Privacy in 2025: Global TrendsInData Labs

In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy. This infographic contains: -AI and data privacy: Key findings -Statistics on AI data privacy in the today’s world -Tips on how to overcome data privacy challenges -Benefits of AI data security investments. Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots. 📕 Here's what you can expect: - Modeling: Build end-to-end processes using BPMN. - Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes. - Operating: Control process instances with rewind, replay, pause, and stop functions. - Monitoring: Use dashboards and embedded analytics for real-time insights into process instances. This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes. 👨‍🏫 Speaker: Andrei Vintila, Principal Product Manager @UiPath This session streamed live on April 29, 2025, 16:00 CET. Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.

Rusty Waters: Elevating Lakehouses Beyond Sparkcarlyakerly1

Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark? At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍 Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies

Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity

Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts! 📕 Agenda Welcome & Introductions Orchestrator API Overview Exploring the Swagger Interface Test Manager API Highlights Streamlining Automation & Testing with APIs (Demo) Q&A and Open Discussion Perfect for developers, testers, and automation enthusiasts! 👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/ This session streamed live on April 29, 2025, 18:00 CET. Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In Francechb3

TrsLabs - Fintech Product & Business ConsultingTrs Labs

AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB

Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada

Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.

Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex

Build Your Own Copilot & Agents For DevsBrian McKeiver

Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma

Heap, Types of Heap, Insertion and DeletionJaydeep Kale

Greenhouse_Monitoring_Presentation.pptx.hpbmnnxrvb

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...BookNet Canada

The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john

AI and Data Privacy in 2025: Global TrendsInData Labs

Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity

Rusty Waters: Elevating Lakehouses Beyond Sparkcarlyakerly1

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Aqusag Technologies

Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity

Linux Professional Institute LPIC-1 Exam.pdfRHCSA Guru

Manifest Pre-Seed Update | A Humanoid OEM Deeptech In Francechb3

How is Kafka so Fast?

1. Ricardo Paiva and Hervé Rivière Understanding the design of Kafka and how it handles Criteo workload How is Kafka so fast?

2. What is Kafka?

3. 3 • Apache Kafka is a distributed message queue • Open-sourced by LinkedIn in 2011 • High-throughput • Highly distributed • Fault-tolerant • Low-latency What is Kafka?

4. 4 • • Use case • GLUP pipeline (aka Kafka Local) • Streaming event processing platform (aka Kafka Stream) • Some figures : • 14 clusters / 200 servers / 7 DC • Up to 7 millions messages / sec • Up to 150 TB processed per day Kafka @ Criteo ?

5. 5 • Topics, Partitions and Offsets 7 6 5 4 3 2 1 08910 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 08 7 6 5 4 3 2 1 0891011 7 6 5 4 3 2 1 08 7 6 5 4 3 2 1 0 Partition 0 Partition 1 Partition 0 Partition 1 Partition 2 Partition 3 OldNew Writes Topic A Topic B

6. Complexity inside the clients

7. 7 • Brokers • Manage partitions • Receive from producer records for a (topic, partition) • Answer to consumer asking records for (topic, partition, offset) • Manage replicas • Manage consumer coordination • Assigning good partitions to the good consumer Broker 1 Producer Broker 2 Consumer Consumer Fetch (Topic A, Partition 4, Offset 10) Bytes Fetch (Topic B, Partition 1, Offset 10) Bytes

8. 8 • Producers Producer Broker (partition leader) Broker (replica) Broker (replica) ack • Producers decide what partitions to send to; • Producers can send a batch of messages; • Producers can compress a batch; • Producers wait for acknowledgement from the broker (acks=1) or broker + replica (acks=all);

9. 9 • Consumers ConsumerBroker 6 7 8 9 10 11 12 12 offset=7 Partition 2: Partition 2, offset 6 7 8 9 1 2 3Commit offset=9 • Consumers control what offset to consume from; • Consumers commit offsets to kafka, but it’s just another Kafka topic; • Consumers can receive batched and / or compressed data; • Kafka coordinates which partitions each consumer will consume from.

10. Did you say SSD is better than HDD ?

11. 11 • Faster but not that much

12. 12 • • Each Kafka partition is mapped to segment files • Segment file : log append structure • Records are immutable • Broker is doing very few random disk search Only sequential I/O Kafka Active Segment file Old segment files

13. 13 • • Kafka relies on native Linux Page cache (read-ahead and write-behind) • JVM off-heap cache for free • Kafka records aren’t deserialized in Kafka JVM • No Java object memory overhead • No OutOfMemory issue • No big GC pauses Caching data for free Kafka Active Segment file Disk OS Old segments files

14. 14 • Reliability with replication • Kafka disk writes are asynchronous • Kafka replicas synchronisation (over network) is synchronous • Trusting replicas in case of data corruption / server crash Broker (partition leader) Broker (replica) Broker (previous leader)

15. Zero Copy

16. 16 • Sending data from file to network (traditional approach) read(file, tmp_buf, len); write(socket, tmp_buf, len);

17. 17 • Sending data from file to network (zero-copy approach) transferTo(position, count, writableChannel);

18. Make things simple

19. 19 • • Paralelism based on topic partitions; • Data compressed/uncompressed on the client; • Producers send a batch of messages; • No serialization/deserialization costs on the brokers; • Writing directly to file: • Append only (cheapers disks); • No complex data structure (no BTree or LSM tree); • Uses OS memory management; • Relies on replicas not on disks; • Zero-copy; Key takeaways

20. Thank you! #rivers

Editor's Notes

#2: Do quick presentation of each other short agenda (first kafka basics + seconds design choice that made it a great tool for our scale)
#3: Why this name : just because initial creator (Jay Kreps) liked this author, like the fact he was a writer and think it was a good name for an OS project.
#6: Topic is just lake a table in a DB but for a queue for a queue we called that topic. You send message to Bid request topic and you received message from billable click topic Partition are a section of a topic. So here topic A have two partiotn / topic B have 4. Partitions are spread over different servers but one partition is always fitting in one server. Topic can be bid request and billable click Bid request as 1000 partitions Partitions are in different server Order only inside a partition Each message as a monotonic offset. Focus on : - Kafka is just storing bytes / no schema --> you can send image in kafka if you want (not a wonderfull idea, but it works)
#7: First step we want to explain you is complexity is not in server but in client
#8: Producer and consumer Broker is dummy Difference between rabbit MQ or oyher queue : you can have huge queue if you want (cf event sourcing store) limit is disk / don’t care about status of a message is it well received is dummy + pull and not push You can group together consumer to create a consumer group and so a distributed application. Broker is managing coordianation of consumer to assgn good partition to good consumer
#9: Focus on : - No SPOF /no broker acting like gateway for the cluster : producer is maintenaing the mapping (topic, partition) -> broker Batch is only logic : one physical message (one send request / ack) is containing several messages Batch advantage : Compress is efficient / network ack is efficient : one ack for each 1 000 messages for instance
#10: Warning : consumer receive compress batch data only if producer was sending like that
#12: Cost efficiency + highest perf Advantage here is to use JBOD or RAID Having ssd will cost more with equal perf or even lower
#13: - Same cache system than varnish (HTTP cache server) - Designed to work with linux only. - a heap of 4gb is enough because no data inside (only managing metatdata and client connection)
#14: - Same cache system than varnish (HTTP cache server) - Designed to work with linux only. - a heap of 4gb is enough because no data inside (only managing metatdata and client connection)
#15: Disk is async (and it's ok because network is sync)

How is Kafka so Fast?

Recommended

More Related Content

What's hot (20)

Similar to How is Kafka so Fast? (20)

Recently uploaded (20)

How is Kafka so Fast?

Editor's Notes