SlideShare a Scribd company logo
1
Look Ma, no Code!
Building Streaming Data
Pipelines with Apache Kafka
Big Data LDN, 16 Nov 2017
@rmoff robin@confluent.io
2
Let’s take a trip back in time. Each application has its
own database for storing information. But we want
that information elsewhere for analytics and
reporting.
3
We don't want to query the transactional system, so
we create a process to extract from the source to a
data warehouse / lake
4
Let’s take a trip back in time
We want to unify data from multiple systems, so
create conformed dimensions and batch processes
to federate our data. This is all batch driven, so
latency is built in by design.
5
Let’s take a trip back in time
As well as our data warehouse, we want to use our
transactional data to populate search replicas,
Graph databases, noSQL stores…all introducing
more point-to-point dependencies in our system
6
Let’s take a trip back in time
Ultimately we end up with a spaghetti architecture. It
can't scale easily, it's tightly coupled, it's generally
batch-driven and we can't get data when we want it
where we want it.
7
But…there's hope!
8
Apache Kafka, a distributed streaming platform,
enables us to decouple all our applications creating
data from those utilising it. We can create low-
latency streams of data, transformed as necessary.
9
But…to use stream processing, we need to be Java
coders…don't we?
10
Happy days! We can actually build streaming data
pipelines using just our bare hands, configuration
files, and SQL.
11
Streaming ETL, with Apache Kafka and Confluent Platform
12
$ whoami
• Partner Technology Evangelist @ Confluent
• Working in data & analytics since 2001
• Oracle ACE Director
• Blogging : http://rmoff.net &
https://www.confluent.io/blog/author/robin/
• Twitter: @rmoff
• Geek stuff
• Beer & Fried Breakfasts
13
14
15
What does a streaming platform do?
Publish and
subscribe to
streams of data
similar to a message
queue or enterprise
messaging system.
110101
010111
001101
100010
Store streams
of data
in a fault tolerant
way.
110101
010111
001101
100010
Process
streams of data
in real time, as they
occur.
110101
010111
001101
100010
16
17
Kafka Connect : Separation of Concerns
18
Kafka Connect : Stream data in and out of Kafka
Amazon S3
19
Streaming Application Data to Kafka
• Applications are rich source of events
• Modifying applications is not always possible or
desirable
• And what if the data gets changed within the
database or by other apps?
• JDBC is one option for extracting data
• Confluent Open Source includes JDBC source &
sink connectors
20
Liberate Application Data into Kafka with CDC
• Relational databases use transaction logs to
ensure Durability of data
• Change-Data-Capture (CDC) mines the log to get
raw events from the database
• CDC tools that integrate with Kafka Connect
include:
• Debezium
• DBVisit
• GoldenGate
• Attunity
• + more
21
But I need to
join…aggregate…filter…
22
KSQL from Confluent
A Developer Preview of
KSQL
An Open Source Streaming SQL
Engine for Apache KafkaTM
23
KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent
• Enables stream processing with zero coding required
• The simplest way to process streams of data in real-time
• Powered by Kafka: scalable, distributed, battle-tested
• All you need is Kafka–No complex deployments of bespoke systems for
stream processing
Ksql>
24
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
KSQL: the Simplest Way to Do Stream Processing
25
Streaming ETL, powered by Apache Kafka and Confluent Platform
KSQL
26
Streaming ETL with Apache Kafka and Confluent Platform
27
Streaming ETL with Apache Kafka and Confluent Platform
28
Define a connector
29
Load the connector
30
Tables à Topics
31
Row à Message
32
Single Message Transforms
http://kafka.apache.org/documentation.html#connect_transforms
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
33
Single Message Transforms
http://kafka.apache.org/documentation.html#connect_transforms
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Record data
Bespoke
lineage data
34
Streaming ETL with Apache Kafka and Confluent Platform
35
Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more
{
"name": "es-sink-avro-02",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://localhost:9200",
"type.name": "type.name=kafka-connect",
"topics": "sakila-avro-rental",
"key.ignore": "true",
"transforms":"dropPrefix",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"sakila-avro-(.*)",
"transforms.dropPrefix.replacement":"$1"
}
}
36
Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more
37
Popular Rental Titles over Time
38
Kafka Connect + Schema Registry = WIN
MySQL
Avro
Message
Elasticsearch
Schema
Registry
Avro
Schema
Kafka
Connect
Kafka
Connect
39
Kafka Connect + Schema Registry = WIN
MySQL
Avro
Message
Elasticsearch
Schema
Registry
Avro
Schema
Kafka
Connect
Kafka
Connect
40
Streaming ETL with Apache Kafka and Confluent Platform
41
Streaming ETL with Apache Kafka and Confluent Platform
42
KSQL in action
ksql> CREATE stream rental
(rental_id INT, rental_date INT, inventory_id INT,
customer_id INT, return_date INT, staff_id INT,
last_update INT )
WITH (kafka_topic = 'sakila-rental',
value_format = 'json');
Message
----------------
Stream created
* Command formatted for clarity here.
Linebreaks need to be denoted by  in KSQL
43
KSQL in action
ksql> describe rental;
Field | Type
--------------------------------
ROWTIME | BIGINT
ROWKEY | VARCHAR(STRING)
RENTAL_ID | INTEGER
RENTAL_DATE | INTEGER
INVENTORY_ID | INTEGER
CUSTOMER_ID | INTEGER
RETURN_DATE | INTEGER
STAFF_ID | INTEGER
LAST_UPDATE | INTEGER
44
KSQL in action
ksql> select * from rental limit 3;
1505830937567 | null | 1 | 280113040 | 367 | 130 |
1505830937567 | null | 2 | 280176040 | 1525 | 459 |
1505830937569 | null | 3 | 280722040 | 1711 | 408 |
45
KSQL in action
SELECT rental_id ,
TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS'),
TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS')
FROM rental
limit 3;
1 | 2005-05-24 22:53:30.000 | 2005-05-26 22:04:30.000
2 | 2005-05-24 22:54:33.000 | 2005-05-28 19:40:33.000
3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000
LIMIT reached for the partition.
Query terminated
ksql>
46
KSQL in action
SELECT rental_id ,
TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS'),
TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS'),
ceil((cast(return_date AS DOUBLE) –
cast(rental_date AS DOUBLE) )
/ 60 / 60 / 24 / 1000)
FROM rental;
1 | 2005-05-24 22:53:30.000 | 2005-05-26 22:04:30.000 | 2.0
2 | 2005-05-24 22:54:33.000 | 2005-05-28 19:40:33.000 | 4.0
3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0
47
KSQL in action
CREATE stream rental_lengths AS
SELECT rental_id ,
TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS') ,
TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS') ,
ceil(( cast(return_date AS DOUBLE) – cast( rental_date AS DOUBLE)
) / 60 / 60 / 24 / 1000)
FROM rental;
48
KSQL in action
ksql> select rental_id, rental_date, return_date,
RENTAL_LENGTH_DAYS from rental_lengths;
3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0
4 | 2005-05-24 23:04:41.000 | 2005-06-03 01:43:41.000 | 10.0
7 | 2005-05-24 23:11:53.000 | 2005-05-29 20:34:53.000 | 5.0
49
KSQL in action
$ kafka-topics --zookeeper localhost:2181 --list
RENTAL_LENGTHS
$ kafka-console-consumer --bootstrap-server localhost:9092
--from-beginning --topic RENTAL_LENGTHS | jq '.'
{ "RENTAL_DATE": "2005-05-24 22:53:30.000",
"RENTAL_LENGTH_DAYS": 2,
"RETURN_DATE": "2005-05-26 22:04:30.000",
"RENTAL_ID": 1
}
50
KSQL in action
CREATE stream long_rentals AS
SELECT * FROM rental_lengths WHERE rental_length_days > 7;
ksql> select rental_id, rental_date, return_date,
RENTAL_LENGTH_DAYS from long_rentals;
3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0
4 | 2005-05-24 23:04:41.000 | 2005-06-03 01:43:41.000 | 10.0
51
KSQL in action
$ kafka-console-consumer --bootstrap-server localhost:9092
--from-beginning --topic LONG_RENTALS | jq '.'
{ "RENTAL_DATE": " 2005-05-24 23:03:39.000",
"RENTAL_LENGTH_DAYS": 8,
"RETURN_DATE": " 2005-06-01 22:12:39.000",
"RENTAL_ID": 3
}
52
Streaming ETL with Kafka Connect and KSQL
MySQL
Kafka
Connect
Kafka
Cluster
rental
rental_lengths
long_rentals
Elasticsearch
CREATE STREAM RENTAL_LENGTHS AS
SELECT END_DATE - START_DATE
[…] FROM RENTAL
Kafka
Connect
CREATE STREAM LONG_RENTALS AS
SELECT … FROM RENTAL_LENGTHS
WHERE DURATION > 14
53
Streaming ETL with Apache Kafka and Confluent Platform
54
Streaming ETL with Apache Kafka and Confluent Platform
55
Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more
{
"name": "es-sink-rental-lengths-02",
"config": {
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"schema.ignore": "true",
"connection.url": "http://localhost:9200",
"type.name": "type.name=kafka-connect",
"topics": "RENTAL_LENGTHS",
"topic.index.map": "RENTAL_LENGTHS:rental_lengths",
"key.ignore": "true"
}
}
56
Plot data from KSQL-derived stream
57
Distribution of rental durations, per week
58
Streaming ETL with Apache Kafka and Confluent Platform
MySQL
Elasticsearch
Kafka
Connect
Kafka
Connect
Kafka
Cluster
KSQL
Kafka
Streams
59
Streaming ETL with Apache Kafka and Confluent Platform – no coding!
MySQL
Elasticsearch
Kafka
Connect
Kafka
Connect
Kafka
Cluster
KSQL
Kafka
Streams
60
Streaming ETL, powered by Apache Kafka and Confluent Platform
KSQL
61
62
Confluent Platform: Enterprise Streaming based on Apache Kafka®
Database
Changes
Log Events loT Data Web Events …
CRM
Data Warehouse
Database
Hadoop
Data
Integration
…
Monitoring
Analytics
Custom Apps
Transformations
Real-time Applications
…
Apache Open Source Confluent Open Source Confluent Enterprise
Confluent Platform
Confluent Platform
Apache Kafka®
Core | Connect API | Streams API
Data Compatibility
Schema Registry
Monitoring & Administration
Confluent Control Center | Security
Operations
Replicator | Auto Data Balancing
Development and Connectivity
Clients | Connectors | REST Proxy | KSQL | CLI
63
64
https://github.com/confluentinc/ksql/
https://www.confluent.io/download/
Streaming ETL, powered by Apache Kafka and Confluent Platform
@rmoff robin@confluent.io

More Related Content

What's hot (20)

PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Building Event-Driven (Micro) Services with Apache Kafka
Guido Schmutz
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
PDF
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
confluent
 
PDF
The State of Stream Processing
confluent
 
PPTX
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Treasure Data, Inc.
 
PDF
Streaming Visualization
Guido Schmutz
 
PDF
Streaming Visualization
Guido Schmutz
 
PDF
Streaming Visualisation
Guido Schmutz
 
PDF
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
PDF
Streaming Visualization
Guido Schmutz
 
PDF
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
confluent
 
PDF
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
confluent
 
PDF
Confluent and Elastic: a Lovely Couple - Elastic Stack in a Day 2018
Paolo Castagna
 
PDF
First Steps with Apache Kafka on Google Cloud Platform
confluent
 
PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
PDF
Streaming Visualization
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Building Event-Driven (Micro) Services with Apache Kafka
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko
 
Confluent real time_acquisition_analysis_and_evaluation_of_data_streams_20190...
confluent
 
The State of Stream Processing
confluent
 
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Treasure Data, Inc.
 
Streaming Visualization
Guido Schmutz
 
Streaming Visualization
Guido Schmutz
 
Streaming Visualisation
Guido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualization
Guido Schmutz
 
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
confluent
 
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
confluent
 
Confluent and Elastic: a Lovely Couple - Elastic Stack in a Day 2018
Paolo Castagna
 
First Steps with Apache Kafka on Google Cloud Platform
confluent
 
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Santosh Sahoo
 
Streaming Visualization
Guido Schmutz
 

Similar to Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With Apache Kafka (20)

PDF
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
BBL KAPPA Lesfurets.com
Cedric Vidal
 
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
PPTX
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Kinetica
 
PDF
Chti jug - 2018-06-26
Florent Ramiere
 
PPTX
Databricks Platform.pptx
Alex Ivy
 
PPTX
From Kafka to BigQuery - Strata Singapore
Ofir Sharony
 
PDF
Real-Time Stream Processing with KSQL and Apache Kafka
confluent
 
PDF
Jug - ecosystem
Florent Ramiere
 
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
PPTX
Kafka streams decoupling with stores
Yoni Farin
 
PPTX
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
PDF
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
PDF
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
PPTX
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent
 
PDF
JHipster conf 2019 - Kafka Ecosystem
Florent Ramiere
 
PPTX
Event streaming webinar feb 2020
Maheedhar Gunturu
 
PDF
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
PDF
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
Streaming ETL with Apache Kafka and KSQL
Nick Dearden
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
BBL KAPPA Lesfurets.com
Cedric Vidal
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Kinetica
 
Chti jug - 2018-06-26
Florent Ramiere
 
Databricks Platform.pptx
Alex Ivy
 
From Kafka to BigQuery - Strata Singapore
Ofir Sharony
 
Real-Time Stream Processing with KSQL and Apache Kafka
confluent
 
Jug - ecosystem
Florent Ramiere
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Kafka streams decoupling with stores
Yoni Farin
 
Тарас Кльоба "ETL — вже не актуальна; тривалі живі потоки із системою Apache...
Lviv Startup Club
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
HostedbyConfluent
 
JHipster conf 2019 - Kafka Ecosystem
Florent Ramiere
 
Event streaming webinar feb 2020
Maheedhar Gunturu
 
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
Ad

More from Matt Stubbs (20)

PDF
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
PDF
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
PDF
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
PDF
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
PDF
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
PDF
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
PDF
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
PDF
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
PDF
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
PDF
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
PDF
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
PDF
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
PDF
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
PDF
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
PDF
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
PDF
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 
Blueprint Series: Banking In The Cloud – Ultra-high Reliability Architectures
Matt Stubbs
 
Speed Up Your Apache Cassandra™ Applications: A Practical Guide to Reactive P...
Matt Stubbs
 
Blueprint Series: Expedia Partner Solutions, Data Platform
Matt Stubbs
 
Blueprint Series: Architecture Patterns for Implementing Serverless Microserv...
Matt Stubbs
 
Big Data LDN 2018: DATA, WHAT PEOPLE THINK AND WHAT YOU CAN DO TO BUILD TRUST.
Matt Stubbs
 
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
Matt Stubbs
 
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Matt Stubbs
 
Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS
Matt Stubbs
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Matt Stubbs
 
Big Data LDN 2018: AI VS. GDPR
Matt Stubbs
 
Big Data LDN 2018: REALISING THE PROMISE OF SELF-SERVICE ANALYTICS WITH DATA ...
Matt Stubbs
 
Big Data LDN 2018: TURNING MULTIPLE DATA LAKES INTO A UNIFIED ANALYTIC DATA L...
Matt Stubbs
 
Big Data LDN 2018: MICROSOFT AZURE AND CLOUDERA – FLEXIBLE CLOUD, WHATEVER TH...
Matt Stubbs
 
Big Data LDN 2018: CONSISTENT SECURITY, GOVERNANCE AND FLEXIBILITY FOR ALL WO...
Matt Stubbs
 
Big Data LDN 2018: MICROLISE: USING BIG DATA AND AI IN TRANSPORT AND LOGISTICS
Matt Stubbs
 
Big Data LDN 2018: EXPERIAN: MAXIMISE EVERY OPPORTUNITY IN THE BIG DATA UNIVERSE
Matt Stubbs
 
Big Data LDN 2018: A LOOK INSIDE APPLIED MACHINE LEARNING
Matt Stubbs
 
Big Data LDN 2018: DEUTSCHE BANK: THE PATH TO AUTOMATION IN A HIGHLY REGULATE...
Matt Stubbs
 
Big Data LDN 2018: FROM PROLIFERATION TO PRODUCTIVITY: MACHINE LEARNING DATA ...
Matt Stubbs
 
Big Data LDN 2018: DATA APIS DON’T DISCRIMINATE
Matt Stubbs
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna36
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna36
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna36
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 

Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With Apache Kafka

  • 1. 1 Look Ma, no Code! Building Streaming Data Pipelines with Apache Kafka Big Data LDN, 16 Nov 2017 @rmoff [email protected]
  • 2. 2 Let’s take a trip back in time. Each application has its own database for storing information. But we want that information elsewhere for analytics and reporting.
  • 3. 3 We don't want to query the transactional system, so we create a process to extract from the source to a data warehouse / lake
  • 4. 4 Let’s take a trip back in time We want to unify data from multiple systems, so create conformed dimensions and batch processes to federate our data. This is all batch driven, so latency is built in by design.
  • 5. 5 Let’s take a trip back in time As well as our data warehouse, we want to use our transactional data to populate search replicas, Graph databases, noSQL stores…all introducing more point-to-point dependencies in our system
  • 6. 6 Let’s take a trip back in time Ultimately we end up with a spaghetti architecture. It can't scale easily, it's tightly coupled, it's generally batch-driven and we can't get data when we want it where we want it.
  • 8. 8 Apache Kafka, a distributed streaming platform, enables us to decouple all our applications creating data from those utilising it. We can create low- latency streams of data, transformed as necessary.
  • 9. 9 But…to use stream processing, we need to be Java coders…don't we?
  • 10. 10 Happy days! We can actually build streaming data pipelines using just our bare hands, configuration files, and SQL.
  • 11. 11 Streaming ETL, with Apache Kafka and Confluent Platform
  • 12. 12 $ whoami • Partner Technology Evangelist @ Confluent • Working in data & analytics since 2001 • Oracle ACE Director • Blogging : http://rmoff.net & https://www.confluent.io/blog/author/robin/ • Twitter: @rmoff • Geek stuff • Beer & Fried Breakfasts
  • 13. 13
  • 14. 14
  • 15. 15 What does a streaming platform do? Publish and subscribe to streams of data similar to a message queue or enterprise messaging system. 110101 010111 001101 100010 Store streams of data in a fault tolerant way. 110101 010111 001101 100010 Process streams of data in real time, as they occur. 110101 010111 001101 100010
  • 16. 16
  • 17. 17 Kafka Connect : Separation of Concerns
  • 18. 18 Kafka Connect : Stream data in and out of Kafka Amazon S3
  • 19. 19 Streaming Application Data to Kafka • Applications are rich source of events • Modifying applications is not always possible or desirable • And what if the data gets changed within the database or by other apps? • JDBC is one option for extracting data • Confluent Open Source includes JDBC source & sink connectors
  • 20. 20 Liberate Application Data into Kafka with CDC • Relational databases use transaction logs to ensure Durability of data • Change-Data-Capture (CDC) mines the log to get raw events from the database • CDC tools that integrate with Kafka Connect include: • Debezium • DBVisit • GoldenGate • Attunity • + more
  • 21. 21 But I need to join…aggregate…filter…
  • 22. 22 KSQL from Confluent A Developer Preview of KSQL An Open Source Streaming SQL Engine for Apache KafkaTM
  • 23. 23 KSQL: a Streaming SQL Engine for Apache Kafka™ from Confluent • Enables stream processing with zero coding required • The simplest way to process streams of data in real-time • Powered by Kafka: scalable, distributed, battle-tested • All you need is Kafka–No complex deployments of bespoke systems for stream processing Ksql>
  • 24. 24 CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: the Simplest Way to Do Stream Processing
  • 25. 25 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL
  • 26. 26 Streaming ETL with Apache Kafka and Confluent Platform
  • 27. 27 Streaming ETL with Apache Kafka and Confluent Platform
  • 34. 34 Streaming ETL with Apache Kafka and Confluent Platform
  • 35. 35 Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more { "name": "es-sink-avro-02", "config": { "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector", "connection.url": "http://localhost:9200", "type.name": "type.name=kafka-connect", "topics": "sakila-avro-rental", "key.ignore": "true", "transforms":"dropPrefix", "transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter", "transforms.dropPrefix.regex":"sakila-avro-(.*)", "transforms.dropPrefix.replacement":"$1" } }
  • 36. 36 Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more
  • 38. 38 Kafka Connect + Schema Registry = WIN MySQL Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  • 39. 39 Kafka Connect + Schema Registry = WIN MySQL Avro Message Elasticsearch Schema Registry Avro Schema Kafka Connect Kafka Connect
  • 40. 40 Streaming ETL with Apache Kafka and Confluent Platform
  • 41. 41 Streaming ETL with Apache Kafka and Confluent Platform
  • 42. 42 KSQL in action ksql> CREATE stream rental (rental_id INT, rental_date INT, inventory_id INT, customer_id INT, return_date INT, staff_id INT, last_update INT ) WITH (kafka_topic = 'sakila-rental', value_format = 'json'); Message ---------------- Stream created * Command formatted for clarity here. Linebreaks need to be denoted by in KSQL
  • 43. 43 KSQL in action ksql> describe rental; Field | Type -------------------------------- ROWTIME | BIGINT ROWKEY | VARCHAR(STRING) RENTAL_ID | INTEGER RENTAL_DATE | INTEGER INVENTORY_ID | INTEGER CUSTOMER_ID | INTEGER RETURN_DATE | INTEGER STAFF_ID | INTEGER LAST_UPDATE | INTEGER
  • 44. 44 KSQL in action ksql> select * from rental limit 3; 1505830937567 | null | 1 | 280113040 | 367 | 130 | 1505830937567 | null | 2 | 280176040 | 1525 | 459 | 1505830937569 | null | 3 | 280722040 | 1711 | 408 |
  • 45. 45 KSQL in action SELECT rental_id , TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS'), TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS') FROM rental limit 3; 1 | 2005-05-24 22:53:30.000 | 2005-05-26 22:04:30.000 2 | 2005-05-24 22:54:33.000 | 2005-05-28 19:40:33.000 3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 LIMIT reached for the partition. Query terminated ksql>
  • 46. 46 KSQL in action SELECT rental_id , TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS'), TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS'), ceil((cast(return_date AS DOUBLE) – cast(rental_date AS DOUBLE) ) / 60 / 60 / 24 / 1000) FROM rental; 1 | 2005-05-24 22:53:30.000 | 2005-05-26 22:04:30.000 | 2.0 2 | 2005-05-24 22:54:33.000 | 2005-05-28 19:40:33.000 | 4.0 3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0
  • 47. 47 KSQL in action CREATE stream rental_lengths AS SELECT rental_id , TIMESTAMPTOSTRING(rental_date, 'yyyy-MM-dd HH:mm:ss.SSS') , TIMESTAMPTOSTRING(return_date, 'yyyy-MM-dd HH:mm:ss.SSS') , ceil(( cast(return_date AS DOUBLE) – cast( rental_date AS DOUBLE) ) / 60 / 60 / 24 / 1000) FROM rental;
  • 48. 48 KSQL in action ksql> select rental_id, rental_date, return_date, RENTAL_LENGTH_DAYS from rental_lengths; 3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0 4 | 2005-05-24 23:04:41.000 | 2005-06-03 01:43:41.000 | 10.0 7 | 2005-05-24 23:11:53.000 | 2005-05-29 20:34:53.000 | 5.0
  • 49. 49 KSQL in action $ kafka-topics --zookeeper localhost:2181 --list RENTAL_LENGTHS $ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic RENTAL_LENGTHS | jq '.' { "RENTAL_DATE": "2005-05-24 22:53:30.000", "RENTAL_LENGTH_DAYS": 2, "RETURN_DATE": "2005-05-26 22:04:30.000", "RENTAL_ID": 1 }
  • 50. 50 KSQL in action CREATE stream long_rentals AS SELECT * FROM rental_lengths WHERE rental_length_days > 7; ksql> select rental_id, rental_date, return_date, RENTAL_LENGTH_DAYS from long_rentals; 3 | 2005-05-24 23:03:39.000 | 2005-06-01 22:12:39.000 | 8.0 4 | 2005-05-24 23:04:41.000 | 2005-06-03 01:43:41.000 | 10.0
  • 51. 51 KSQL in action $ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic LONG_RENTALS | jq '.' { "RENTAL_DATE": " 2005-05-24 23:03:39.000", "RENTAL_LENGTH_DAYS": 8, "RETURN_DATE": " 2005-06-01 22:12:39.000", "RENTAL_ID": 3 }
  • 52. 52 Streaming ETL with Kafka Connect and KSQL MySQL Kafka Connect Kafka Cluster rental rental_lengths long_rentals Elasticsearch CREATE STREAM RENTAL_LENGTHS AS SELECT END_DATE - START_DATE […] FROM RENTAL Kafka Connect CREATE STREAM LONG_RENTALS AS SELECT … FROM RENTAL_LENGTHS WHERE DURATION > 14
  • 53. 53 Streaming ETL with Apache Kafka and Confluent Platform
  • 54. 54 Streaming ETL with Apache Kafka and Confluent Platform
  • 55. 55 Kafka Connect to stream Kafka Topics to Elasticsearch…MySQL…& more { "name": "es-sink-rental-lengths-02", "config": { "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector", "key.converter": "org.apache.kafka.connect.json.JsonConverter", "value.converter": "org.apache.kafka.connect.json.JsonConverter", "key.converter.schemas.enable": "false", "value.converter.schemas.enable": "false", "schema.ignore": "true", "connection.url": "http://localhost:9200", "type.name": "type.name=kafka-connect", "topics": "RENTAL_LENGTHS", "topic.index.map": "RENTAL_LENGTHS:rental_lengths", "key.ignore": "true" } }
  • 56. 56 Plot data from KSQL-derived stream
  • 57. 57 Distribution of rental durations, per week
  • 58. 58 Streaming ETL with Apache Kafka and Confluent Platform MySQL Elasticsearch Kafka Connect Kafka Connect Kafka Cluster KSQL Kafka Streams
  • 59. 59 Streaming ETL with Apache Kafka and Confluent Platform – no coding! MySQL Elasticsearch Kafka Connect Kafka Connect Kafka Cluster KSQL Kafka Streams
  • 60. 60 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL
  • 61. 61
  • 62. 62 Confluent Platform: Enterprise Streaming based on Apache Kafka® Database Changes Log Events loT Data Web Events … CRM Data Warehouse Database Hadoop Data Integration … Monitoring Analytics Custom Apps Transformations Real-time Applications … Apache Open Source Confluent Open Source Confluent Enterprise Confluent Platform Confluent Platform Apache Kafka® Core | Connect API | Streams API Data Compatibility Schema Registry Monitoring & Administration Confluent Control Center | Security Operations Replicator | Auto Data Balancing Development and Connectivity Clients | Connectors | REST Proxy | KSQL | CLI
  • 63. 63