SlideShare a Scribd company logo
© 2013 Impetus Technologies - Confidential1
Kafka/Camus Project
Phase I
Mountain View, CA
March 2013
(photos courtesy of LinkedIn)
© 2013 Impetus Technologies - Confidential2
Agenda
• Objective
• What tool to use?
• Kafka & Camus overview
• Infrastructure
• Architecture
• Performance benchmarks
© 2013 Impetus Technologies - Confidential3
Objective
• Customer has events (Data, UI) that happen
real-time, that need to be analyzed
• Immediate need for batch-oriented mechanism
• Events need to by ETL’ed and analyzed in
Hadoop
• Future need for more real-time stream
analysis
• Potential bursts of streaming data
© 2013 Impetus Technologies - Confidential4
What tool to use?
• JMS:
• just an API
• Not cross language
• Painful
• Doesn’t scale
• Active MQ
• Didn’t work for Linkedin:
• http://sites.computer.org/debull/A12june/pipeline
.pdf
• Apache Flume
© 2013 Impetus Technologies - Confidential5
Kafka overview
• Distributed Scalable Pub/Sub system for big
data
• Producer -> Broker -> Consumer of message
topics
• Can have multiple clients consuming at
different velocities
(synchronous/asynchronous)
• Notion of consumer group to parallelize
consumption of messages
• Persists messages so ability to rewind
© 2013 Impetus Technologies - Confidential6
Kafka overview
• More overview pictures:
© 2013 Impetus Technologies - Confidential7
Camus overview
• Pipeline out of Kafka to HDFS
• Automatic discovery of topics and partitions
• Finds latest offsets from Kafka nodes
• Uses Avro by default; option to use your own
Decoder
• Allocates topic pulls among a set # of Hadoop
job tasks
• Move data files to HDFS directories according
to timestamp
• Remembers last offset / topic
© 2013 Impetus Technologies - Confidential8
Infrastructure
• Kafka 0.7.2
• 3 nodes
• Benchmark tool to issue message size, #
of threads, # of messages, topic name,
data encoding
• CDH 4.2
• 1 NN, 1 SNN, 3 slaves for Hadoop
• Camus
• JSON or Avro decoder
• Zookeeper
• Hive
© 2013 Impetus Technologies - Confidential9
Infrastructure
• 8 Amazon EC2 large instances
• Dual core 2.0 Ghz
• 1 7200 rpm SATA drive
• 8 Gigs memory
• 200 bytes message
• 1 Producer – 1 consumer
© 2013 Impetus Technologies - Confidential10
Customer
architecture
Gam
ing
Shop
ping
Invite
friends
Consume
topics via
Camus
every hour
Kafka topic:
Data events
(i.e. User
profile
registrations)
Kafka topic:
UI events (i.e.
game
interaction)
Use Hive to
analyze the data
© 2013 Impetus Technologies - Confidential11
Performance
summary• Producer:
• Avg 20,000 messages / sec
• 3.81 MB per sec
• Consumer:
• 16,600 messages/ sec
• 3.17 MB per sec -> 190 Gig/hr
• Customer Goal: “want to scale to 5000 events
per second at peak.”
© 2013 Impetus Technologies - Confidential12
Performance
benchmarkdata size input Data type
Storage size on HDFS
(in bytes)
Hive Count
(in sec)
Hive max
(in sec) Camus run time Kafka
500000 records JSON text data 103779151 38.3 5946 seconds 34.2
JSON Serde 103779151 46.3 48.246 seconds 34.2
Avro data 60962022 25.2 29.354 seconds 15.9
1 Million records JSON text data -1M 416556931 27.582 50.8891 minute 40.56
JSON Serde -1M 416556931 39.428 32.305 40.56
Avro data 1M 122041553 35.806 26.3281 minute 22.36
7 Million records JSON text data - 7M 1456636071 57.895 111.5983 minutes 50 seconds 388
JSON Serde - 7M 1456636071 83.225 83.7763 minutes 50 seconds 388
Avro data - 7M 866962131 60.63 62.8964 minutes 50 181
10 Million records JSON text data - 10M 1919381181 78.337 144.6675 minutes 1 seconds 558
JSON Serde - 10M 1919381181 103.4 1105 minutes 1 seconds 558
Avro data - 10M 1239446765 87.042 90.9587 minutes 23 seconds 230
15 Million records JSON text data - 15M 3157886975 107.325 201.1256 minutes 24 seconds 851
JSON Serde - 15M 3157886975 141.345 153.365 851
Avro data - 15M 1865267728 96.9 98.98 minutes 26 seconds 377
20 Million records JSON text data - 20M 1159
JSON Serde - 20M 1159
Avro data - 20M 2476833359 133.606 153.46411 minutes 2 seconds 234
© 2013 Impetus Technologies - Confidential13
© 2013 Impetus Technologies - Confidential14
Kafka Speed Performance
benchmark
Kafka 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20 Million
records
JSON text data 34.2 40.56 388 558 851 1159
JSON Serde 34.2 40.56 388 558 851 1159
Avro data 15.9 22.36 181 230 377 534
34.2 40.56
388
558
851
1159
34.2 40.56
388
558
851
1159
15.9 22.36
181
230
377
534
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records
Kafka comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential15
Camus Speed
Performance benchmark
Camus 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20 Million
records
JSON text data 46 60 230 301 384
JSON Serde 46 60 230 301 384
Avro data 54 85 290 443 506 662
0
100
200
300
400
500
600
700
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records
Camus comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential16
Count Speed Performance
Count 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20
Million
records
JSON text data 38.3 27.58 57.89 78.337 107.325
JSON Serde 46.3 39.42 83.2 103.4 141.345
Avro data 25.2 35.8 60.6 87.042 96.9 133.606
0
20
40
60
80
100
120
140
160
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records
Select Count(*) comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential17
Max Speed Performance
0
50
100
150
200
250
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records
Max(field) comparison
JSON text data JSON Serde Avro data
Max 500000 records
1 Million
records 7 Million records
10 Million
records
15 Million
records
20 Million
records
JSON text data 59 50.889 111.598 144.667 201.125
JSON Serde 48.2 32.305 83.776 110 153.365
Avro data 29.3 26.328 62.896 90.958 98.9 153.464
© 2013 Impetus Technologies - Confidential18
Q&A
Thank You
Ad

More Related Content

What's hot (20)

Kafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backboneKafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
Joe Stein
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
PivotalOpenSourceHub
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
Akash Vacher
 
Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
confluent
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
Ayyappadas Ravindran (Appu)
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
Joe Stein
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
Knoldus Inc.
 
Introduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridIntroduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
David Groozman
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
Joe Stein
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
Akash Vacher
 
Real time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and CouchbaseReal time Messages at Scale with Apache Kafka and Couchbase
Real time Messages at Scale with Apache Kafka and Couchbase
Will Gardella
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
confluent
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
Joe Stein
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
Knoldus Inc.
 
Introduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - MadridIntroduction to Apache Kafka and why it matters - Madrid
Introduction to Apache Kafka and why it matters - Madrid
Paolo Castagna
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Lightbend
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 

Similar to Architecture of a Kafka camus infrastructure (20)

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
Peter Bakas
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
Robotics technical Presentation
Robotics technical PresentationRobotics technical Presentation
Robotics technical Presentation
klepsydratechnologie
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Slide share device to iot solution – a blueprint
Slide share   device to iot solution – a blueprintSlide share   device to iot solution – a blueprint
Slide share device to iot solution – a blueprint
Guy Vinograd ☁
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
confluent
 
KONG-APIGateway.pptx
KONG-APIGateway.pptxKONG-APIGateway.pptx
KONG-APIGateway.pptx
Agusto Sipahutar
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
Peter Bakas
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
Monal Daxini
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
Altinity Ltd
 
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Slide share device to iot solution – a blueprint
Slide share   device to iot solution – a blueprintSlide share   device to iot solution – a blueprint
Slide share device to iot solution – a blueprint
Guy Vinograd ☁
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Monal Daxini
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
confluent
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
Twitter Developers
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Devoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en basDevoxx university - Kafka de haut en bas
Devoxx university - Kafka de haut en bas
Florent Ramiere
 
Ad

Recently uploaded (20)

Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Ad

Architecture of a Kafka camus infrastructure

  • 1. © 2013 Impetus Technologies - Confidential1 Kafka/Camus Project Phase I Mountain View, CA March 2013 (photos courtesy of LinkedIn)
  • 2. © 2013 Impetus Technologies - Confidential2 Agenda • Objective • What tool to use? • Kafka & Camus overview • Infrastructure • Architecture • Performance benchmarks
  • 3. © 2013 Impetus Technologies - Confidential3 Objective • Customer has events (Data, UI) that happen real-time, that need to be analyzed • Immediate need for batch-oriented mechanism • Events need to by ETL’ed and analyzed in Hadoop • Future need for more real-time stream analysis • Potential bursts of streaming data
  • 4. © 2013 Impetus Technologies - Confidential4 What tool to use? • JMS: • just an API • Not cross language • Painful • Doesn’t scale • Active MQ • Didn’t work for Linkedin: • http://sites.computer.org/debull/A12june/pipeline .pdf • Apache Flume
  • 5. © 2013 Impetus Technologies - Confidential5 Kafka overview • Distributed Scalable Pub/Sub system for big data • Producer -> Broker -> Consumer of message topics • Can have multiple clients consuming at different velocities (synchronous/asynchronous) • Notion of consumer group to parallelize consumption of messages • Persists messages so ability to rewind
  • 6. © 2013 Impetus Technologies - Confidential6 Kafka overview • More overview pictures:
  • 7. © 2013 Impetus Technologies - Confidential7 Camus overview • Pipeline out of Kafka to HDFS • Automatic discovery of topics and partitions • Finds latest offsets from Kafka nodes • Uses Avro by default; option to use your own Decoder • Allocates topic pulls among a set # of Hadoop job tasks • Move data files to HDFS directories according to timestamp • Remembers last offset / topic
  • 8. © 2013 Impetus Technologies - Confidential8 Infrastructure • Kafka 0.7.2 • 3 nodes • Benchmark tool to issue message size, # of threads, # of messages, topic name, data encoding • CDH 4.2 • 1 NN, 1 SNN, 3 slaves for Hadoop • Camus • JSON or Avro decoder • Zookeeper • Hive
  • 9. © 2013 Impetus Technologies - Confidential9 Infrastructure • 8 Amazon EC2 large instances • Dual core 2.0 Ghz • 1 7200 rpm SATA drive • 8 Gigs memory • 200 bytes message • 1 Producer – 1 consumer
  • 10. © 2013 Impetus Technologies - Confidential10 Customer architecture Gam ing Shop ping Invite friends Consume topics via Camus every hour Kafka topic: Data events (i.e. User profile registrations) Kafka topic: UI events (i.e. game interaction) Use Hive to analyze the data
  • 11. © 2013 Impetus Technologies - Confidential11 Performance summary• Producer: • Avg 20,000 messages / sec • 3.81 MB per sec • Consumer: • 16,600 messages/ sec • 3.17 MB per sec -> 190 Gig/hr • Customer Goal: “want to scale to 5000 events per second at peak.”
  • 12. © 2013 Impetus Technologies - Confidential12 Performance benchmarkdata size input Data type Storage size on HDFS (in bytes) Hive Count (in sec) Hive max (in sec) Camus run time Kafka 500000 records JSON text data 103779151 38.3 5946 seconds 34.2 JSON Serde 103779151 46.3 48.246 seconds 34.2 Avro data 60962022 25.2 29.354 seconds 15.9 1 Million records JSON text data -1M 416556931 27.582 50.8891 minute 40.56 JSON Serde -1M 416556931 39.428 32.305 40.56 Avro data 1M 122041553 35.806 26.3281 minute 22.36 7 Million records JSON text data - 7M 1456636071 57.895 111.5983 minutes 50 seconds 388 JSON Serde - 7M 1456636071 83.225 83.7763 minutes 50 seconds 388 Avro data - 7M 866962131 60.63 62.8964 minutes 50 181 10 Million records JSON text data - 10M 1919381181 78.337 144.6675 minutes 1 seconds 558 JSON Serde - 10M 1919381181 103.4 1105 minutes 1 seconds 558 Avro data - 10M 1239446765 87.042 90.9587 minutes 23 seconds 230 15 Million records JSON text data - 15M 3157886975 107.325 201.1256 minutes 24 seconds 851 JSON Serde - 15M 3157886975 141.345 153.365 851 Avro data - 15M 1865267728 96.9 98.98 minutes 26 seconds 377 20 Million records JSON text data - 20M 1159 JSON Serde - 20M 1159 Avro data - 20M 2476833359 133.606 153.46411 minutes 2 seconds 234
  • 13. © 2013 Impetus Technologies - Confidential13
  • 14. © 2013 Impetus Technologies - Confidential14 Kafka Speed Performance benchmark Kafka 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records JSON text data 34.2 40.56 388 558 851 1159 JSON Serde 34.2 40.56 388 558 851 1159 Avro data 15.9 22.36 181 230 377 534 34.2 40.56 388 558 851 1159 34.2 40.56 388 558 851 1159 15.9 22.36 181 230 377 534 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records Kafka comparison JSON text data JSON Serde Avro data
  • 15. © 2013 Impetus Technologies - Confidential15 Camus Speed Performance benchmark Camus 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records JSON text data 46 60 230 301 384 JSON Serde 46 60 230 301 384 Avro data 54 85 290 443 506 662 0 100 200 300 400 500 600 700 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records Camus comparison JSON text data JSON Serde Avro data
  • 16. © 2013 Impetus Technologies - Confidential16 Count Speed Performance Count 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records JSON text data 38.3 27.58 57.89 78.337 107.325 JSON Serde 46.3 39.42 83.2 103.4 141.345 Avro data 25.2 35.8 60.6 87.042 96.9 133.606 0 20 40 60 80 100 120 140 160 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records Select Count(*) comparison JSON text data JSON Serde Avro data
  • 17. © 2013 Impetus Technologies - Confidential17 Max Speed Performance 0 50 100 150 200 250 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records Max(field) comparison JSON text data JSON Serde Avro data Max 500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records JSON text data 59 50.889 111.598 144.667 201.125 JSON Serde 48.2 32.305 83.776 110 153.365 Avro data 29.3 26.328 62.896 90.958 98.9 153.464
  • 18. © 2013 Impetus Technologies - Confidential18 Q&A Thank You

Editor's Notes

  • #5: Active mq: t if the queue backed up beyond what could be kept inmemory, performance would severely degrade due to heavy amounts of random I/O.Flume: Flume is a distributed, reliable, and available service for moving large amounts of log data. It accepts on streaming data flows; Ability to store the data temporarily.Very fast.
  • #6: Assumes everythings (all the layers) are distributed, and can be started at any given time, no master node. (scalability). It’s all synced and coordinated by Zookepper.Kafka acts as a buffer; between live activity and asynchronous processing.Was built for high thruput by Linkedin.Provides a single pipeline of data for both online and offline consumers. is well suited for situations where you need to both process data in realtime while still having the possibility to analyse them in bulk via MapReduce later on.concept
  • #8: Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes.Atstartup time the job reads its current offset for each partition from a file in HDFS and queries Kafka to discoverany new topics and read the current log offset for each partition. It then loads all data from the last load offset tothe current Kafka offset and writes it out to Hadoop,