SlideShare a Scribd company logo
Introduction to
Apache Kafka
Thessaloniki not-only-Java Meetup
Dimitris Kontokostas @Diffbot
About me...
Data geek,
Software engineer
Open source enthusiast,
Remote Worker
Currently working @ Diffbot
Summary
● The Log
● Kafka 101
● KStream
● KSQL
● Connectors
● Schemas
Let’s keep this interactive...
Kafka is
A distributed streaming platform
A distributed publish-subscribe messaging system/queue
A distributed, immutable event logging data store (*)
The Log
What every software engineer should know about real-time
data's unifying abstraction
A great post by Jay Kreps (creator of Kafka)
Parts of the post are used in this presentation
What is a log
- Records events
- What happened & when
- Append only, left-to-right
- Timestamps
Logs in databases
- As internal structures to facilitate ACID
- Gradually getting more exposed
- As a data subscription mechanism
Logs in distributed systems
Ordering
Replication
Active-active vs Active-passive
- E.g. “+5”, “-2”, “*4”
- Vs “0”, “5”, “3”, “12”
Events & Tables
Two sides of the same coin
events & tables
1. User_add {id:1, email: “joh@doe.com”}
2. User_change {id:1 email: “john@doe.com”}
3. User_add {id:2, email: “mary@doe.com”}
4. User_change {id:2, email: “mary@doe.gr”}
5. User_add {id:3, email: “peter@doe.com”}
Table User
id email
1 ?
2 ?
3 ?
events & tables
1. User_add {id:1, email: “joh@doe.com”}
2. User_change {id:1 email: “john@doe.com”}
3. User_add {id:2, email: “mary@doe.com”}
4. User_change {id:2, email: “mary@doe.gr”}
5. User_add {id:3, email: “peter@doe.com”}
Table User
id email
1 john@doe.com
2 mary@doe.gr
3 peter@doe.com
events & tables
1. User_add {id:1, email: “joh@doe.com”}
2. User_change {id:1 email: “john@doe.com”}
3. User_add {id:2, email: “mary@doe.com”}
4. User_change {id:2, email: “mary@doe.gr”}
5. User_add {id:3, email: “peter@doe.com”}
6. User_remove{id:1 }
Table User
id email
1 john@doe.com
2 mary@doe.gr
3 peter@doe.com
events & tables
DB
?
I have a cool idea!!!
Report users with for more than 2
email edits per day as suspicious
What if we also did X...
Decoupling from the table
Tables are opinionated views of your data
The later you can defer your opinion, the better...
LinkedIn before Kafka
LinkedIn after Kafka
Kafka Basics
Basic terminology
Kafka maintains feeds of messages in categories called topics
Kafka Producers are processes that publish messages on a kafka topic
Kafka Consumers are processes that subscribe to topics and process the feed of
published messages
Messages can be in any form, text, json, binary etc.
Kafka brokers are the servers that comprise the kafka cluster
Topics & Logs
1 Topic ≈ 1 Partitioned Log
(with configurable retention period)
What is the difference with “The Log”?
Ordering...
Kafka Messages
Anything can be a message
=> Strings, JSON, XML, binary formats
Messages are preferably small ( < few Kb)
=> Big messages affect brokers & throughput
=> Reference big messages as external resources
Message IDs define which partition the message is stored
=> Default behavior / can be manually overwritten
Msg ID
Body
1 2 3 4 5
~~~
Producers & Consumers
Message Delivery Semantics
=> At least once
=> At most once
=> Exactly once
Multiple (independent) Consumers can read from the same topic
=> Consumers manage their own offset (stored on Kafka)
=> Messages remain on Kafka
Consumer Scaling
Every Consumer has a Consumer ID
=> Kafka keeps track of the offets
=> rejoining will continue from latest offset
Consumers with the same ID
belong to the same Consumer Group
Consumers from the same group
read messages from different partitions
=> Topic partition size is the hard limit
KStreams/KSQL
public class KafkaConsumerExample {
...
static void runConsumer() throws InterruptedException {
final Consumer<Long, String> consumer = createConsumer();
while (true) {
final ConsumerRecords< Long, String> consumerRecords = consumer.poll();
consumerRecords.forEach(record -> {
System.out.printf( "Consumer Record:(%d, %s, %d, %d)n" ,
record.key(), record.value(),record.partition(), record.offset());
});
consumer.commitAsync();
}
consumer.close();
}}
https://dzone.com/articles/writing-a-kafka-consumer-in-java
Print topic messages (Consumer API)
static void createWordCountStream(final StreamsBuilder builder) {
final KStream<String, String> textLines = builder.stream(inputTopic);
textLines
.forEach(value -> System.out.printLn(value));
}
Print topic messages (KStream API)
KStream API
Abstraction on top of classic Kafka API
Provide a Java Stream-like API on Kafka
=> has a source, sink & stream processors
=> aggregations, windowing, KTables / state
Source processor
=> consumes on or more topics, forwards downstream
Stream processors
=> apply transformations, aggregations, etc
Sink Processors
=> final processors, stores results on a topic or KTable
KTables
Read-only & fault-tolerant state store
Can be in memory (on consumer) or global (persistent on the cluster)
RocksDB is a default implementation
Example: keep an up-to-date table of [user id, email]
Aggregations & Windowing
Example: Users with more than 2 email changes in 24 hours
final KTable<Windowed<String>, Long> anomalousUsers = views
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofDays(1)))
.count()
.filter((windowedUserId, count) -> count >= 2);
final KStream<String, Long> anomalousUsersForConsole = anomalousUsers
.toStream()
.filter((windowedUserId, count) -> count != null)
.map((windowedUserId, count) -> new KeyValue<>(windowedUserId.toString(), count));
anomalousUsersForConsole.to("AnomalousUsers", Produced.with(stringSerde, longSerde));
KSQL
CREATE TABLE users_view AS
SELECT * FROM USERS;
CREATE STREAM suspicious_users AS
SELECT id, count(*)
FROM USERS
WINDOW SESSION (1 DAY)
GROUP BY id;
HAVING count(*) >= 2;
Disclaimer: most probably has typos !!!
Connectors & Schema Registries
Big library of source & sink Kafka connectors
=> S3, most DBs, ES, HDFS, other streaming frameworks
=> e.g. write this topic to MySQL table X
=> Kafka Connect API for writing a custom one
Schemas always change, not only in RDBMS :)
=> Message body freedom has a cost
=> Schema registries come to the… hmm... help…
=> data validation, de/serialization, backwards compatibility
Summary
Logs are everywhere
Logs help us defer data schema decision
Kafka is an event logging data store
Has a great ecosystem of libraries & integrations
Thank you for your attention
Questions?
Ad

More Related Content

What's hot (20)

An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
kafka
kafkakafka
kafka
Amikam Snir
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
NexThoughts Technologies
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Srikrishna k
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Kumar Shivam
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
confluent
 
Apache Kafka - Overview
Apache Kafka - OverviewApache Kafka - Overview
Apache Kafka - Overview
CodeOps Technologies LLP
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
iamtodor
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Viswanath J
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
Paul Brebner
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
Amir Sedighi
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
Dmitry Tolpeko
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
emreakis
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
Jeff Holoman
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
confluent
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
confluent
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
iamtodor
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
Paul Brebner
 

Similar to Introduction to apache kafka (20)

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Large scale, distributed and reliable messaging with Kafka
Large scale, distributed and reliable messaging with KafkaLarge scale, distributed and reliable messaging with Kafka
Large scale, distributed and reliable messaging with Kafka
Rafał Hryniewski
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
HostedbyConfluent
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
Yoni Farin
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Kafka Architecture | Key Components | kafka training online
Kafka Architecture |  Key Components |  kafka training onlineKafka Architecture |  Key Components |  kafka training online
Kafka Architecture | Key Components | kafka training online
Accentfuture
 
apache kafka training online | kafka online training
apache kafka training online | kafka online trainingapache kafka training online | kafka online training
apache kafka training online | kafka online training
Accentfuture
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
What is apache Kafka?
What is apache Kafka?What is apache Kafka?
What is apache Kafka?
Kenny Gorman
 
What is Apache Kafka®?
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?
Eventador
 
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaApache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Mark Bittmann
 
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps_Fest
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
Edunomica
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Timothy Spann
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Streaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQLStreaming ETL with Apache Kafka and KSQL
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
Timothy Spann
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
Syed Hadoop
 
Large scale, distributed and reliable messaging with Kafka
Large scale, distributed and reliable messaging with KafkaLarge scale, distributed and reliable messaging with Kafka
Large scale, distributed and reliable messaging with Kafka
Rafał Hryniewski
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
Slim Baltagi
 
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
Introducing KSML: Kafka Streams for low code environments | Jeroen van Dissel...
HostedbyConfluent
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
Yoni Farin
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
GeeksLab Odessa
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
Guido Schmutz
 
Kafka Architecture | Key Components | kafka training online
Kafka Architecture |  Key Components |  kafka training onlineKafka Architecture |  Key Components |  kafka training online
Kafka Architecture | Key Components | kafka training online
Accentfuture
 
apache kafka training online | kafka online training
apache kafka training online | kafka online trainingapache kafka training online | kafka online training
apache kafka training online | kafka online training
Accentfuture
 
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingPrinceton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann
 
What is apache Kafka?
What is apache Kafka?What is apache Kafka?
What is apache Kafka?
Kenny Gorman
 
What is Apache Kafka®?
What is Apache Kafka®?What is Apache Kafka®?
What is Apache Kafka®?
Eventador
 
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaApache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Mark Bittmann
 
Ad

More from Dimitris Kontokostas (14)

Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...
Dimitris Kontokostas
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
Dimitris Kontokostas
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
Dimitris Kontokostas
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
Dimitris Kontokostas
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
Dimitris Kontokostas
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
Dimitris Kontokostas
 
DBpedia past, present & future
DBpedia past, present & futureDBpedia past, present & future
DBpedia past, present & future
Dimitris Kontokostas
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 
DBpedia ♥ Commons
DBpedia ♥ CommonsDBpedia ♥ Commons
DBpedia ♥ Commons
Dimitris Kontokostas
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
Dimitris Kontokostas
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
Dimitris Kontokostas
 
Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...Data quality assessment - connecting the pieces...
Data quality assessment - connecting the pieces...
Dimitris Kontokostas
 
Graph databases & data integration v2
Graph databases & data integration v2Graph databases & data integration v2
Graph databases & data integration v2
Dimitris Kontokostas
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
8th DBpedia meeting / California 2016
8th DBpedia meeting /  California 20168th DBpedia meeting /  California 2016
8th DBpedia meeting / California 2016
Dimitris Kontokostas
 
Semantically enhanced quality assurance in the jurion business use case
Semantically enhanced quality assurance in the jurion  business use caseSemantically enhanced quality assurance in the jurion  business use case
Semantically enhanced quality assurance in the jurion business use case
Dimitris Kontokostas
 
Graph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDFGraph databases & data integration - the case of RDF
Graph databases & data integration - the case of RDF
Dimitris Kontokostas
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
Dimitris Kontokostas
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
RDFUnit - Test-Driven Linked Data quality Assessment (WWW2014)
Dimitris Kontokostas
 
DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)DBpedia i18n - Amsterdam Meeting (30/01/2014)
DBpedia i18n - Amsterdam Meeting (30/01/2014)
Dimitris Kontokostas
 
Ad

Recently uploaded (20)

Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Salesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docxSalesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docx
José Enrique López Rivera
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Leading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael JidaelLeading AI Innovation As A Product Manager - Michael Jidael
Leading AI Innovation As A Product Manager - Michael Jidael
Michael Jidael
 
Salesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docxSalesforce AI Associate 2 of 2 Certification.docx
Salesforce AI Associate 2 of 2 Certification.docx
José Enrique López Rivera
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Automation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From AnywhereAutomation Dreamin': Capture User Feedback From Anywhere
Automation Dreamin': Capture User Feedback From Anywhere
Lynda Kane
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko"Rebranding for Growth", Anna Velykoivanenko
"Rebranding for Growth", Anna Velykoivanenko
Fwdays
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Introduction to apache kafka

  • 1. Introduction to Apache Kafka Thessaloniki not-only-Java Meetup Dimitris Kontokostas @Diffbot
  • 2. About me... Data geek, Software engineer Open source enthusiast, Remote Worker Currently working @ Diffbot
  • 3. Summary ● The Log ● Kafka 101 ● KStream ● KSQL ● Connectors ● Schemas Let’s keep this interactive...
  • 4. Kafka is A distributed streaming platform A distributed publish-subscribe messaging system/queue A distributed, immutable event logging data store (*)
  • 5. The Log What every software engineer should know about real-time data's unifying abstraction A great post by Jay Kreps (creator of Kafka) Parts of the post are used in this presentation
  • 6. What is a log - Records events - What happened & when - Append only, left-to-right - Timestamps
  • 7. Logs in databases - As internal structures to facilitate ACID - Gradually getting more exposed - As a data subscription mechanism
  • 8. Logs in distributed systems Ordering Replication Active-active vs Active-passive - E.g. “+5”, “-2”, “*4” - Vs “0”, “5”, “3”, “12”
  • 9. Events & Tables Two sides of the same coin
  • 10. events & tables 1. User_add {id:1, email: “[email protected]”} 2. User_change {id:1 email: “[email protected]”} 3. User_add {id:2, email: “[email protected]”} 4. User_change {id:2, email: “[email protected]”} 5. User_add {id:3, email: “[email protected]”} Table User id email 1 ? 2 ? 3 ?
  • 11. events & tables 1. User_add {id:1, email: “[email protected]”} 2. User_change {id:1 email: “[email protected]”} 3. User_add {id:2, email: “[email protected]”} 4. User_change {id:2, email: “[email protected]”} 5. User_add {id:3, email: “[email protected]”} Table User id email 1 [email protected] 2 [email protected] 3 [email protected]
  • 12. events & tables 1. User_add {id:1, email: “[email protected]”} 2. User_change {id:1 email: “[email protected]”} 3. User_add {id:2, email: “[email protected]”} 4. User_change {id:2, email: “[email protected]”} 5. User_add {id:3, email: “[email protected]”} 6. User_remove{id:1 } Table User id email 1 [email protected] 2 [email protected] 3 [email protected]
  • 14. I have a cool idea!!! Report users with for more than 2 email edits per day as suspicious What if we also did X...
  • 15. Decoupling from the table Tables are opinionated views of your data The later you can defer your opinion, the better...
  • 19. Basic terminology Kafka maintains feeds of messages in categories called topics Kafka Producers are processes that publish messages on a kafka topic Kafka Consumers are processes that subscribe to topics and process the feed of published messages Messages can be in any form, text, json, binary etc. Kafka brokers are the servers that comprise the kafka cluster
  • 20. Topics & Logs 1 Topic ≈ 1 Partitioned Log (with configurable retention period) What is the difference with “The Log”? Ordering...
  • 21. Kafka Messages Anything can be a message => Strings, JSON, XML, binary formats Messages are preferably small ( < few Kb) => Big messages affect brokers & throughput => Reference big messages as external resources Message IDs define which partition the message is stored => Default behavior / can be manually overwritten Msg ID Body 1 2 3 4 5 ~~~
  • 22. Producers & Consumers Message Delivery Semantics => At least once => At most once => Exactly once Multiple (independent) Consumers can read from the same topic => Consumers manage their own offset (stored on Kafka) => Messages remain on Kafka
  • 23. Consumer Scaling Every Consumer has a Consumer ID => Kafka keeps track of the offets => rejoining will continue from latest offset Consumers with the same ID belong to the same Consumer Group Consumers from the same group read messages from different partitions => Topic partition size is the hard limit
  • 25. public class KafkaConsumerExample { ... static void runConsumer() throws InterruptedException { final Consumer<Long, String> consumer = createConsumer(); while (true) { final ConsumerRecords< Long, String> consumerRecords = consumer.poll(); consumerRecords.forEach(record -> { System.out.printf( "Consumer Record:(%d, %s, %d, %d)n" , record.key(), record.value(),record.partition(), record.offset()); }); consumer.commitAsync(); } consumer.close(); }} https://dzone.com/articles/writing-a-kafka-consumer-in-java Print topic messages (Consumer API)
  • 26. static void createWordCountStream(final StreamsBuilder builder) { final KStream<String, String> textLines = builder.stream(inputTopic); textLines .forEach(value -> System.out.printLn(value)); } Print topic messages (KStream API)
  • 27. KStream API Abstraction on top of classic Kafka API Provide a Java Stream-like API on Kafka => has a source, sink & stream processors => aggregations, windowing, KTables / state Source processor => consumes on or more topics, forwards downstream Stream processors => apply transformations, aggregations, etc Sink Processors => final processors, stores results on a topic or KTable
  • 28. KTables Read-only & fault-tolerant state store Can be in memory (on consumer) or global (persistent on the cluster) RocksDB is a default implementation Example: keep an up-to-date table of [user id, email]
  • 29. Aggregations & Windowing Example: Users with more than 2 email changes in 24 hours final KTable<Windowed<String>, Long> anomalousUsers = views .groupByKey() .windowedBy(TimeWindows.of(Duration.ofDays(1))) .count() .filter((windowedUserId, count) -> count >= 2); final KStream<String, Long> anomalousUsersForConsole = anomalousUsers .toStream() .filter((windowedUserId, count) -> count != null) .map((windowedUserId, count) -> new KeyValue<>(windowedUserId.toString(), count)); anomalousUsersForConsole.to("AnomalousUsers", Produced.with(stringSerde, longSerde));
  • 30. KSQL CREATE TABLE users_view AS SELECT * FROM USERS; CREATE STREAM suspicious_users AS SELECT id, count(*) FROM USERS WINDOW SESSION (1 DAY) GROUP BY id; HAVING count(*) >= 2; Disclaimer: most probably has typos !!!
  • 31. Connectors & Schema Registries Big library of source & sink Kafka connectors => S3, most DBs, ES, HDFS, other streaming frameworks => e.g. write this topic to MySQL table X => Kafka Connect API for writing a custom one Schemas always change, not only in RDBMS :) => Message body freedom has a cost => Schema registries come to the… hmm... help… => data validation, de/serialization, backwards compatibility
  • 32. Summary Logs are everywhere Logs help us defer data schema decision Kafka is an event logging data store Has a great ecosystem of libraries & integrations
  • 33. Thank you for your attention Questions?