SlideShare a Scribd company logo
Apache Kafka
Introduction
Kumar Shivam
A distributed streaming platform
History
• Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache
Software Foundation, written in Scala and Java.
• Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java
stream processing library.
• Kafka uses a binary TCP-based protocol
Use cases
• Messaging system
• Activity Tracking
• Gather metrics from many different locations
• Application logs gathering
• Stream processing (with the Kafka streams API or Spark for example)
• De-coupling of system dependencies.
• Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
Application data flow(without using Kafka)
Application data flow(using Kafka)
Apache kafka
Companies Use cases
• Netflix - it uses kafka to apply recommendations in the real time while watching TV shows
• Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in
the real time.
• LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real
time.
• Spotify - Kafka is used at Spotify as part of their log delivery system.
• Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning
analytics/dashboards.
• Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service
Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines.
• Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs.
• Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in
transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical
team to do near-real time business intelligence.
Kafka in ERP
Jargons
• Topics (category)
• Partition
• Offset
• Replicas
• Broker
• Cluster
• Producers
• Consumer
• Leader
• Follower
Topic(Category)
Stream of messages belonging to a particular
category is called a topic. Data is stored in
topics.
Partition
• Topics split into partitions .
• Partition contains msg. in an immutable
ordered seq.
• Partition is impl. as set of segment files of
equal sizes.
• Data once written to a partition are
immutable.
Offset
Each message gets stored into partitions with
an incr. ID (i.e. Unique seq. id )called as
offset”.
Offset
Replicas
• Backup of partition.
• Replication factor – No. of copies of data
over multiple brokers.
Offset
Replicas
• Topics X and partition 0 is available in
broker 0 and Similarly for Partition 1 .
• Problem :-
• In Broker 2 , we are keeping actual data
(i.e. Topic- X Partition 1 ) and replicated
data (i.e. Topic – X Partition 0 ).
• Solution :-
• Choose one broker’s partition as a
leader and the rest as followers.
Brokers(containers)
• System responsible for maintaining the
published data.
• Holds multiple topics with multiple
partitions.
• Brokers are stateless.
• 1 Kafka broker = ~ 1 Million read/write
per sec.
• Handles TBs of meg. Without
performance hit.
• Brokers in the cluster is identified by an
ID.
• Kafka broker are also known as Bootstrap
broker because con. With any one broker
means connection with entire cluster.
Offset
Kafka Clusters
• Kafka’s having more than one broker are
called as Kafka cluster.
• A Kafka cluster can be expanded without
downtime.
• These clusters are used to manage the
persistence and replication of message
data.
• It typically consists of multiple broker to
maintain load balance.
Kafka Ecosystem
Producer
• The publisher of messages to one or
more Kafka topics
Offset
Consumer
• Read data from brokers.
• Consumers subscribes to one or more
topics and consume published messages
by pulling data from the brokers.
Offset
Leaders
• Node responsible for all reads and writes
for the given partition.
Offset
Follower
• Node which follows leader instructions
are called as followers.
• If leader fails , one of the follower will
automatically become the new leader.
Offset
Zookeeper
• It manages and co-ordinates Kafka
brokers.
• Used to notify producer and consumer
abt. the presence and failure of any
broker in the Kafka system.
• So that in Failure, Producer & Consumer
can take decision and start coordinating
their task with some other broker.
Kafka Producers
• How does the producer write data to the cluster?
• Message Keys
• Acknowledgment
• With the concept of key to send message in a specific order. The key enables the producer with two choices
• Send the data to the each partition
• If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e.,
distributed to each partition).
• Send the data to specific partition.
• If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the
same partition.
without key
• scenario where a producer writes data to
the Kafka cluster
with key
• scenario where a producer specifies a key
as Prod_id
Prod_id_1
Prod_id_2
Acknowledgment
• In order to write data to the Kafka cluster,
the producer has another choice of
acknowledgment. Message
Sent
Message
Received
Case 1
• Producer sends data to each of the
Broker, but not receiving any
acknowledgment
• acks = 0 : producer sends the data to the
broker but does not wait for the
acknowledgement.
Case 2 (half - Duplex)
• Producer sends data to each of the
Broker, receiving any acknowledgment
• acks = 1 : producer will wait for the
leader's acknowledgement. The leader
asks the broker whether it successfully
received the data, and then
acknowledgment.
• The producers send data to the brokers.
Broker 1 holds the leader. Thus, the
leader asks Broker 1 whether it has
successfully received data. After receiving
the Broker's confirmation, the leader
sends the feedback to the Producer with
ack=1.
Case 3 (full - Duplex)
• Producer sends data to each of the
Broker, receiving acknowledgment from
both end.
• acks = all : the acknowledgment is done
by both the leader and its followers.
Kafka Core Apis
Producer Consumer
Comparision
Parameters Apache Kafka Apache Spark
Developers Originally developed by LinkedIn. Later, donated to Apache
Software Foundation.
Originally developed at the University of California. Later, it was
donated to Apache Software Foundation.
Infrastructure It is a Java client library. Thus, it can execute wherever Java is
supported.
It executes on the top of the Spark stack. It can be either Spark
standalone, YARN, or container-based.
Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc.
Processing Model It processes the events as it arrives. Thus, it uses Event-at-a-
time (continuous) processing model.
It has a micro-batch processing model. It splits the incoming
streams into small batches for further processing.
Latency It has low latency than Apache Spark It has a higher latency.
ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark.
Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark.
Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python.
Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams
to store and distribute data.
Booking.com, Yelp (ad platform) uses Spark streams for
handling millions of ad requests per day.
Apache kafka
Interact with Apache Kafka clusters in Azure
HDInsight using a REST proxy
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
1. High data flow
concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes
frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real-
time.
Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming
support.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
2. Storing terabytes of data with frequent updates
concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert
without compromising on performance . To even generate reports, data had to be processed every few hours — so
read had to be fast too.
Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read
performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best
write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
• Concerns ?
3. Processing huge amount of data
concern 3 Data processing had to be carried out at two places in the pipeline.
1. During write, where we have to stream data from Kafka, process it and save it to Cassandra.
2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources
and aggregate it at multiple columns.
Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG
scheduler, a query optimizer, and a physical execution engine.
Hoe can we use Spark, Kafka and Cassandra
to build a robust analytical platform?
Spark
batch job
Security
• Data Encription among brokers and between client – broker
• Using SSL
• Authentication modes between client and brokers
• Using SSL(mutual Authentication)
• Using SASL(i.e. Kerberos or SCRAM-SHA)
• Authorisation of read/write operation by cients
• ACLs on topics.
Thank you!
Keep in touch.
https://www.linkedin.com/in/kumar-shivam-3a07807b/
Kshivam@firstam.com
https://github.com/ThirstyBrain
Ad

More Related Content

What's hot (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
NexThoughts Technologies
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Apache Kafka - Overview
Apache Kafka - OverviewApache Kafka - Overview
Apache Kafka - Overview
CodeOps Technologies LLP
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
João Paulo Leonidas Fernandes Dias da Silva
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 
kafka
kafkakafka
kafka
Amikam Snir
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
Martin Podval
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
Timothy Spann
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Clement Demonchy
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
Dimitris Kontokostas
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
Aparna Pillai
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
Whiteklay
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Diego Pacheco
 
Kafka
KafkaKafka
Kafka
shrenikp
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Jemin Patel
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 

Similar to Apache kafka (20)

Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
StreamNative
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Kafkha real time analytics platform.pptx
Kafkha real time analytics platform.pptxKafkha real time analytics platform.pptx
Kafkha real time analytics platform.pptx
dummyuseage1
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
Srikrishna k
 
Introduction_to_Kafka - A brief Overview.pdf
Introduction_to_Kafka - A brief Overview.pdfIntroduction_to_Kafka - A brief Overview.pdf
Introduction_to_Kafka - A brief Overview.pdf
ssuserc49ec4
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Denodo
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
Knoldus Inc.
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
StreamNative
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
Angelo Cesaro
 
Kafkha real time analytics platform.pptx
Kafkha real time analytics platform.pptxKafkha real time analytics platform.pptx
Kafkha real time analytics platform.pptx
dummyuseage1
 
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Building streaming data applications using Kafka*[Connect + Core + Streams] b...
Data Con LA
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
Slim Baltagi
 
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
Kafka.pptx (uploaded from MyFiles SomnathDeb_PC)
somnathdeb0212
 
Introduction_to_Kafka - A brief Overview.pdf
Introduction_to_Kafka - A brief Overview.pdfIntroduction_to_Kafka - A brief Overview.pdf
Introduction_to_Kafka - A brief Overview.pdf
ssuserc49ec4
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
apachekafka-160907180205.pdf
apachekafka-160907180205.pdfapachekafka-160907180205.pdf
apachekafka-160907180205.pdf
TarekHamdi8
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...
Denodo
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
Knoldus Inc.
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis
 
Ad

Recently uploaded (20)

Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Ad

Apache kafka

  • 1. Apache Kafka Introduction Kumar Shivam A distributed streaming platform
  • 2. History • Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. • Kafka can connect to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. • Kafka uses a binary TCP-based protocol
  • 3. Use cases • Messaging system • Activity Tracking • Gather metrics from many different locations • Application logs gathering • Stream processing (with the Kafka streams API or Spark for example) • De-coupling of system dependencies. • Integration with Spark, FLink, Strom ,Hadoop and many big data tech.
  • 7. Companies Use cases • Netflix - it uses kafka to apply recommendations in the real time while watching TV shows • Uber - It uses to gather user,taxi and trip data in real-time to compute and forcast demand and compute surge pricing in the real time. • LinkedIn - it uses to prevent spam , collect user interactions to make better connections recommendations in the real time. • Spotify - Kafka is used at Spotify as part of their log delivery system. • Coursera - At Coursera, Kafka powers education at scale, serving as the data pipeline for realtime learning analytics/dashboards. • Oracle - Oracle provides native connectivity to Kafka from its Enterprise Service Bus product called OSB (Oracle Service Bus) which allows developers to leverage OSB built-in mediation capabilities to implement staged data pipelines. • Trivago - Trivago uses Kafka for stream processing in Storm as well as processing of application logs. • Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.
  • 9. Jargons • Topics (category) • Partition • Offset • Replicas • Broker • Cluster • Producers • Consumer • Leader • Follower
  • 10. Topic(Category) Stream of messages belonging to a particular category is called a topic. Data is stored in topics.
  • 11. Partition • Topics split into partitions . • Partition contains msg. in an immutable ordered seq. • Partition is impl. as set of segment files of equal sizes. • Data once written to a partition are immutable.
  • 12. Offset Each message gets stored into partitions with an incr. ID (i.e. Unique seq. id )called as offset”. Offset
  • 13. Replicas • Backup of partition. • Replication factor – No. of copies of data over multiple brokers. Offset
  • 14. Replicas • Topics X and partition 0 is available in broker 0 and Similarly for Partition 1 . • Problem :- • In Broker 2 , we are keeping actual data (i.e. Topic- X Partition 1 ) and replicated data (i.e. Topic – X Partition 0 ). • Solution :- • Choose one broker’s partition as a leader and the rest as followers.
  • 15. Brokers(containers) • System responsible for maintaining the published data. • Holds multiple topics with multiple partitions. • Brokers are stateless. • 1 Kafka broker = ~ 1 Million read/write per sec. • Handles TBs of meg. Without performance hit. • Brokers in the cluster is identified by an ID. • Kafka broker are also known as Bootstrap broker because con. With any one broker means connection with entire cluster. Offset
  • 16. Kafka Clusters • Kafka’s having more than one broker are called as Kafka cluster. • A Kafka cluster can be expanded without downtime. • These clusters are used to manage the persistence and replication of message data. • It typically consists of multiple broker to maintain load balance. Kafka Ecosystem
  • 17. Producer • The publisher of messages to one or more Kafka topics Offset
  • 18. Consumer • Read data from brokers. • Consumers subscribes to one or more topics and consume published messages by pulling data from the brokers. Offset
  • 19. Leaders • Node responsible for all reads and writes for the given partition. Offset
  • 20. Follower • Node which follows leader instructions are called as followers. • If leader fails , one of the follower will automatically become the new leader. Offset
  • 21. Zookeeper • It manages and co-ordinates Kafka brokers. • Used to notify producer and consumer abt. the presence and failure of any broker in the Kafka system. • So that in Failure, Producer & Consumer can take decision and start coordinating their task with some other broker.
  • 22. Kafka Producers • How does the producer write data to the cluster? • Message Keys • Acknowledgment • With the concept of key to send message in a specific order. The key enables the producer with two choices • Send the data to the each partition • If the value of key=NULL, it means that the data is sent without a key. Thus, it will be distributed in a round-robin manner (i.e., distributed to each partition). • Send the data to specific partition. • If the value of the key!=NULL, it means the key is attached with the data, and thus all messages will always be delivered to the same partition.
  • 23. without key • scenario where a producer writes data to the Kafka cluster
  • 24. with key • scenario where a producer specifies a key as Prod_id Prod_id_1 Prod_id_2
  • 25. Acknowledgment • In order to write data to the Kafka cluster, the producer has another choice of acknowledgment. Message Sent Message Received
  • 26. Case 1 • Producer sends data to each of the Broker, but not receiving any acknowledgment • acks = 0 : producer sends the data to the broker but does not wait for the acknowledgement.
  • 27. Case 2 (half - Duplex) • Producer sends data to each of the Broker, receiving any acknowledgment • acks = 1 : producer will wait for the leader's acknowledgement. The leader asks the broker whether it successfully received the data, and then acknowledgment. • The producers send data to the brokers. Broker 1 holds the leader. Thus, the leader asks Broker 1 whether it has successfully received data. After receiving the Broker's confirmation, the leader sends the feedback to the Producer with ack=1.
  • 28. Case 3 (full - Duplex) • Producer sends data to each of the Broker, receiving acknowledgment from both end. • acks = all : the acknowledgment is done by both the leader and its followers.
  • 30. Comparision Parameters Apache Kafka Apache Spark Developers Originally developed by LinkedIn. Later, donated to Apache Software Foundation. Originally developed at the University of California. Later, it was donated to Apache Software Foundation. Infrastructure It is a Java client library. Thus, it can execute wherever Java is supported. It executes on the top of the Spark stack. It can be either Spark standalone, YARN, or container-based. Data Sources It processes data from Kafka itself via topics and streams. Spark ingest data from various files, Kafka, Socket source, etc. Processing Model It processes the events as it arrives. Thus, it uses Event-at-a- time (continuous) processing model. It has a micro-batch processing model. It splits the incoming streams into small batches for further processing. Latency It has low latency than Apache Spark It has a higher latency. ETL Transformation It is not supported in Apache Kafka. This transformation is supported in Spark. Fault-tolerance Fault-tolerance is complex in Kafka. Fault-tolerance is easy in Spark. Language Support It supports Java mainly. It supports multiple languages such as Java, Scala, R, Python. Use Cases The New York Times, Zalando, Trivago, etc. use Kafka Streams to store and distribute data. Booking.com, Yelp (ad platform) uses Spark streams for handling millions of ad requests per day.
  • 32. Interact with Apache Kafka clusters in Azure HDInsight using a REST proxy
  • 33. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 1. High data flow concern 1 :- A lot of orders get placed on the Walmart website every second, item availability also changes frequently. Updating data (which can be 100 MB per second) means streaming information to analytics platform in real- time. Solution :- Kafka is a distributed, scalable fault-tolerant messaging system which by default provides a streaming support.
  • 34. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 2. Storing terabytes of data with frequent updates concern 2 :- To store item availability data, we needed datastore which can process huge amount of upsert without compromising on performance . To even generate reports, data had to be processed every few hours — so read had to be fast too. Solution :- Though RDBMS can store large amount of data however it cannot provide reliable upsert and read performance. We had good experience with Cassandra in past, hence, it was the first choice. Apache Cassandra has best write and read performance. Like Kafka it is distributed, highly scalable and fault-tolerant.
  • 35. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? • Concerns ? 3. Processing huge amount of data concern 3 Data processing had to be carried out at two places in the pipeline. 1. During write, where we have to stream data from Kafka, process it and save it to Cassandra. 2. while generating business reports, where we have to read complete Cassandra table, join it with other data sources and aggregate it at multiple columns. Solution :- Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
  • 36. Hoe can we use Spark, Kafka and Cassandra to build a robust analytical platform? Spark batch job
  • 37. Security • Data Encription among brokers and between client – broker • Using SSL • Authentication modes between client and brokers • Using SSL(mutual Authentication) • Using SASL(i.e. Kerberos or SCRAM-SHA) • Authorisation of read/write operation by cients • ACLs on topics.
  • 38. Thank you! Keep in touch. https://www.linkedin.com/in/kumar-shivam-3a07807b/ [email protected] https://github.com/ThirstyBrain