SlideShare a Scribd company logo
Apache Kafka, Tiered Storage and TensorFlow for
Streaming Machine Learning without a Data Lake
Kai Waehner
Technology Evangelist
contact@kai-waehner.de
LinkedIn
@KaiWaehner
www.confluent.io
www.kai-waehner.de
Disclaimer – Status for Tiered Storage in August 2020
KIP-405 –
Add Tiered Storage Support to Kafka
Confluent is actively working on this
with the open source community -
Uber is leading this initiative
Confluent Tiered Storage is available
today in Confluent Platform and used
under the hood in Confluent Cloud
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
405%3A+Kafka+Tiered+Storage
(in the works)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
STREAM
PROCESSING
Create and store
materialized views
Filter
Analyze in-flight
Time
C CC
Event Streaming
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Machine Learning to Improve Traditional
and to Build New Use Cases
5www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Real Time
Tracking
Predictive
Maintenance
Fraud
Detection
Cross
Selling
Transportation
Rerouting
Customer
Service
Inventory
ManagementAutonomous
Driving
Face
Recognition
Robotics
Speech
Translation
Video
Generation
Supply Chain
Optimization Simulations
Real Time Information Digital Transformation Strategic Goals
Customer
Churn
Global Automotive Company
Builds Connected Car Infrastructure
6
Digital Transformation
● Improve Customer
Experience
● Increase Revenue
● Reduce Risk
3 years ago Today 2 years in the future
Project begins Connected car infrastructure
in production for first use
cases
Improved processes leveraging
machine learning (predictive
maintenance, cross-selling)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Streaming Analytics for
Predictive Maintenance at Scale
7
IoT
Integration
Layer
Batch
Analytics
Platform
BI
Dashboard
Streaming
Platform
Big Data
Integration
Layer
Car Sensors
Streaming Platform
Other Components
Real Time
Monitoring
System
All
Data
Critical
Data
Ingest
Data
Human
Intelligence
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Machine Learning (ML)
...allows computers to find hidden insights without
being programmed where to look
8
Machine Learning
● Decision Trees
● Naïve Bayes
● Clustering
● Neural
Networks
● Etc.
Deep Learning
● CNN
● RNN
● Transformer
● Autoencoder
● Etc.
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Streaming Analytics for
Predictive Maintenance at Scale
9
IoT
Integration
Layer
Batch
Analytics
Platform
BI
Dashboard
Streaming
Platform
Big Data
Integration
Layer
Car Sensors
Streaming Platform
Analytics Platform
Other Components
Real Time
Monitoring
System
All
Data
Critical
Data
Ingest
Data
Potential
DetectAnalytics
Platform
Train
Analytic
Model
Data
Processing
Analytic
Model
Preprocess
Data
Consume
Data
Deploy
Analytic Model
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
The First
Analytic Models
10
How to deploy the models
in production?
…real-time processing?
…at scale?
…24/7 zero uptime?
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Hidden Technical Debt
in Machine Learning Systems
11
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Scalable, Technology-Agnostic ML Infrastructures
What is this
thing used everywhere?
https://www.infoq.com/presentations/netflix-ml-meson
https://eng.uber.com/michelangelo
https://www.infoq.com/presentations/paypal-data-service-fraud
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
A Streaming Platform -
The Underpinning of an Event-Driven Architecture
15
Microservices
DBs
SaaS apps
Mobile
Customer 360
Real-time fraud
detection
Data warehouse
Producers
Consumers
Database
change
Microservices
events
SaaS
data
Customer
experiences
Streams of real time events
Stream processing apps
Connectors
Connectors
Stream processing apps
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Apache Kafka as Infrastructure for ML
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Apache Kafka’s Open Ecosystem as Infrastructure for ML
Kafka
Streams/
ksqlDB
Kafka Connect
Confluent REST Proxy
Confluent Schema Registry
Go/.NET/Python
Kafka Producer
ksqlDB
Python
Client
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Ingestion of
IoT Data
20
Replication
MirrorMaker /
Confluent Replicator
Kafka
Connect
Analytics /
Machine
Learning
Ca
rsCa
rsCa
rsCa
rs
Cars
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Data
Preprocessing
21
Preprocessing
Filter, transform, anonymize, extract features
Streams
Data Ready
For Model Training
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Preprocessing
with ksqlDB
22
SELECT car_id, event_id, car_model_id, sensor_input
FROM car_sensor c
LEFT JOIN car_models m ON c.car_model_id = m.car_model_id
WHERE m.car_model_type ='Audi_A8';
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Data Ingestion
into a Data Store for Model Training
(and Consumption by other Decoupled Applications)
23
Connect
Preprocessed
Data
Batch Near
Real Time
Real
Time
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Extreme scale
usingTensorFlow
and TPUs
in the cloud!
Analytic
Model
Model Training
Using an Elastic
Infrastructure in
the Cloud
24www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
TensorFlow Model —
Autoencoder for Anomaly Detection
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 25
Direct streaming ingestion
for model training
with TensorFlow I/O + Kafka Plugin
(no additional data storage
like S3 or HDFS required!)
Time
Model BModel A
Producer
Distributed
Commit Log
Streaming Ingestion and Model Training
with TensorFlow IO
https://github.com/tensorflow/io
26
Model X
(at a later time)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Store Data Long-Term
in Kafka?
Today, Kafka works well for recent events,
short horizon storage, and manual data
balancing.
Kafka’s present-day design offers
extraordinarily low messaging latency by
storing topic data on fast disks that are
collocated with brokers. This is usually
good.
But sometimes, you need to store a huge
amount of data for a long time.
Kafka
Processing
App
Storage
Transactions,
auth, quota
enforcement,
compaction, ...
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Simplified Data Lake Architecture
Tiered Storage for Kafka provides
● one platform for all data processing
● an event-based source of truth for
materialized views
● no need for a pipeline between Kafka and
a Data Lake like Hadoop
Benefits
● cost reduction
● long-term backup
● performance isolation
(real-time and historical analysis in the same cluster)
Confluent Tiered Storage for Kafka
Object Store
Processing Storage
Transactions,
auth, quota
enforcement,
compaction, ...
Local
Remote
Kafka
Apps
Store Forever
Older data is offloaded to inexpensive object
storage, permitting it to be consumed at any time.
Save $$$
Storage limitations, like capacity and duration,
are effectively uncapped.
Instantaneously scale up and down
Your Kafka clusters will be able to automatically
self-balance load and hence elastically scale
(Only available in Confluent Platform)
www.kai-waehner.de | @KaiWaehner
Confluent Tiered Storage for Kafka
30www.kai-waehner.de | @KaiWaehner
(Only available in Confluent Platform)
Use Cases for Reprocessing Historical Events
Give me all events from time A to time B
Real-time Producer
Time
• New consumer application
• Error-handling
• Compliance / regulatory processing
• Query and analyze existing events
• Model training
Real-time Consumer
Consumer of
Historical Data
www.kai-waehner.de | @KaiWaehner
Local Predictions
Model Training
in Cloud
Model Deployment
at the Edge
Analytic Model
Separation of
Model Training and Model Inference
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 32
Streams
Input Event
Prediction
Request
Response
Model Serving
TensorFlow Serving
gRPC / HTTP
Application
Stream Processing
with External Model and RPC
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 33
Prediction
Stream Processing
Model
doPrediction()
return
value
Stream Processing
with Embedded Model
Streams
Input Event
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 34
“CREATE STREAM AnomalyDetection AS
SELECT sensor_id,
detectAnomaly(sensor_values)
FROM car_engine;“
User Defined Function (UDF)
Model Deployment with
Apache Kafka, ksqlDB
and TensorFlow
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 35
Streaming Analytics with
Kafka and TensorFlow
36www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
MQTT Proxy
MongoDB
Storage
MongoDB
Dashboards
Search
Analytics
Kafka
Cluster
Kafka
Connect
Car Sensors
Kafka Ecosystem
TensorFlow
Other Components
Kafka
Streams
Application
All
Data
Critical
Data
Ingest
Data
Potential
DetectTensorFlow
Train
Analytic
Model
ksqlDB
Analytic
Model
Preprocess
Data
Consume
Data
Deploy
Analytic Model
Tiered
Storage
Mobile App
BI Tool
Demo: 100,000 Connected Cars
(Kafka + ksqlDB + MQTT + TensorFlow)
https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 37
Live
Demo
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 38
Machine Learning + Apache Kafka
à Examples @ Github
39
https://github.com/kaiwaehner
One pipeline to rule them all
Real-time model scoring, batch model training, near-real time BI analytics
Give me all events from time A to time B
Car sensors
(MQTT connector)
Time
Production
infrastructure
(Java)
Data science / analytics infrastructure
(Python + Jupyter)
www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
Kai Waehner
Technology Evangelist
contact@kai-waehner.de
@KaiWaehner
www.confluent.io
www.kai-waehner.de
www.confluent.io
LinkedIn
Questions? Feedback?
Let’s connect!

More Related Content

What's hot (20)

PDF
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
PDF
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
PDF
Evolving from Messaging to Event Streaming
confluent
 
PDF
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
confluent
 
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
PDF
Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019
confluent
 
PPTX
Real-World Pulsar Architectural Patterns
Devin Bost
 
PPTX
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
HostedbyConfluent
 
PDF
How to Build an Apache Kafka® Connector
confluent
 
PDF
Can Apache Kafka Replace a Database?
Kai Wähner
 
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
PDF
Deep Learning Streaming Platform with Kafka Streams, TensorFlow, DeepLearning...
Kai Wähner
 
PDF
Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL...
confluent
 
PDF
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
PDF
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
PDF
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Kai Wähner
 
PDF
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
confluent
 
PDF
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
HostedbyConfluent
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PDF
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
confluent
 
Evolving from Messaging to Event Streaming
confluent
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
confluent
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
HostedbyConfluent
 
Shattering The Monolith(s) (Martin Kess, Namely) Kafka Summit SF 2019
confluent
 
Real-World Pulsar Architectural Patterns
Devin Bost
 
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
HostedbyConfluent
 
How to Build an Apache Kafka® Connector
confluent
 
Can Apache Kafka Replace a Database?
Kai Wähner
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
Deep Learning Streaming Platform with Kafka Streams, TensorFlow, DeepLearning...
Kai Wähner
 
Using Location Data to Showcase Keys, Windows, and Joins in Kafka Streams DSL...
confluent
 
Secure Kafka at scale in true multi-tenant environment ( Vishnu Balusu & Asho...
confluent
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
confluent
 
Rethinking Stream Processing with Apache Kafka, Kafka Streams and KSQL
Kai Wähner
 
Building Stream Processing Applications with Apache Kafka Using KSQL (Robin M...
confluent
 
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
HostedbyConfluent
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...
HostedbyConfluent
 

Similar to Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake (Kai Waehner, Confluent) Kafka Summit 2020 (20)

PDF
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
PDF
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
PDF
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Kai Wähner
 
PDF
Unleashing Apache Kafka and TensorFlow in the Cloud

Kai Wähner
 
PDF
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Kai Wähner
 
PDF
Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...
confluent
 
PDF
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
PDF
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
confluent
 
PDF
Kai Waehner - Deep Learning at Extreme Scale in the Cloud with Apache Kafka a...
Codemotion
 
PDF
Apache Kafka for Smart Grid, Utilities and Energy Production
Kai Wähner
 
PDF
Apache Kafka® and Analytics in a Connected IoT World
confluent
 
PDF
Apache Kafka Open Source Ecosystem for Machine Learning at Extreme Scale (Apa...
Kai Wähner
 
PPTX
Apache Kafka® + Machine Learning for Supply Chain 
confluent
 
PPTX
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
PDF
How to Leverage the Apache Kafka Ecosystem to Productionize Machine Learning ...
Codemotion
 
PDF
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Kai Wähner
 
PDF
Apache Kafka in the Airline, Aviation and Travel Industry
Kai Wähner
 
PDF
Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem
Kai Wähner
 
PDF
Machine Learning with Apache Kafka in Pharma and Life Sciences
Kai Wähner
 
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
Nitin Kumar
 
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Kai Wähner
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Kai Wähner
 
Streaming Machine Learning with Python, Jupyter, TensorFlow, Apache Kafka and...
Kai Wähner
 
Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFl...
confluent
 
Apache Kafka Streams + Machine Learning / Deep Learning
Kai Wähner
 
Machine Learning and Deep Learning Applied to Real Time with Apache Kafka Str...
confluent
 
Kai Waehner - Deep Learning at Extreme Scale in the Cloud with Apache Kafka a...
Codemotion
 
Apache Kafka for Smart Grid, Utilities and Energy Production
Kai Wähner
 
Apache Kafka® and Analytics in a Connected IoT World
confluent
 
Apache Kafka Open Source Ecosystem for Machine Learning at Extreme Scale (Apa...
Kai Wähner
 
Apache Kafka® + Machine Learning for Supply Chain 
confluent
 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
How to Leverage the Apache Kafka Ecosystem to Productionize Machine Learning ...
Codemotion
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Kai Wähner
 
Apache Kafka in the Airline, Aviation and Travel Industry
Kai Wähner
 
Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem
Kai Wähner
 
Machine Learning with Apache Kafka in Pharma and Life Sciences
Kai Wähner
 
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
PDF
Migration, backup and restore made easy using Kannika
confluent
 
PDF
Five Things You Need to Know About Data Streaming in 2025
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
PDF
Unlocking value with event-driven architecture by Confluent
confluent
 
PDF
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
PDF
Building API data products on top of your real-time data infrastructure
confluent
 
PDF
Speed Wins: From Kafka to APIs in Minutes
confluent
 
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
confluent
 
Webinar Think Right - Shift Left - 19-03-2025.pptx
confluent
 
Migration, backup and restore made easy using Kannika
confluent
 
Five Things You Need to Know About Data Streaming in 2025
confluent
 
Data in Motion Tour Seoul 2024 - Keynote
confluent
 
Data in Motion Tour Seoul 2024 - Roadmap Demo
confluent
 
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
confluent
 
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
confluent
 
Data in Motion Tour 2024 Riyadh, Saudi Arabia
confluent
 
Build a Real-Time Decision Support Application for Financial Market Traders w...
confluent
 
Strumenti e Strategie di Stream Governance con Confluent Platform
confluent
 
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
confluent
 
Building Real-Time Gen AI Applications with SingleStore and Confluent
confluent
 
Unlocking value with event-driven architecture by Confluent
confluent
 
Il Data Streaming per un’AI real-time di nuova generazione
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Break data silos with real-time connectivity using Confluent Cloud Connectors
confluent
 
Building API data products on top of your real-time data infrastructure
confluent
 
Speed Wins: From Kafka to APIs in Minutes
confluent
 
Evolving Data Governance for the Real-time Streaming and AI Era
confluent
 
Ad

Recently uploaded (20)

DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 

Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake (Kai Waehner, Confluent) Kafka Summit 2020

  • 1. Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning without a Data Lake Kai Waehner Technology Evangelist [email protected] LinkedIn @KaiWaehner www.confluent.io www.kai-waehner.de
  • 2. Disclaimer – Status for Tiered Storage in August 2020 KIP-405 – Add Tiered Storage Support to Kafka Confluent is actively working on this with the open source community - Uber is leading this initiative Confluent Tiered Storage is available today in Confluent Platform and used under the hood in Confluent Cloud https://cwiki.apache.org/confluence/display/KAFKA/KIP- 405%3A+Kafka+Tiered+Storage (in the works) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 3. STREAM PROCESSING Create and store materialized views Filter Analyze in-flight Time C CC Event Streaming www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 4. Machine Learning to Improve Traditional and to Build New Use Cases 5www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake Real Time Tracking Predictive Maintenance Fraud Detection Cross Selling Transportation Rerouting Customer Service Inventory ManagementAutonomous Driving Face Recognition Robotics Speech Translation Video Generation Supply Chain Optimization Simulations Real Time Information Digital Transformation Strategic Goals Customer Churn
  • 5. Global Automotive Company Builds Connected Car Infrastructure 6 Digital Transformation ● Improve Customer Experience ● Increase Revenue ● Reduce Risk 3 years ago Today 2 years in the future Project begins Connected car infrastructure in production for first use cases Improved processes leveraging machine learning (predictive maintenance, cross-selling) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 6. Streaming Analytics for Predictive Maintenance at Scale 7 IoT Integration Layer Batch Analytics Platform BI Dashboard Streaming Platform Big Data Integration Layer Car Sensors Streaming Platform Other Components Real Time Monitoring System All Data Critical Data Ingest Data Human Intelligence www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 7. Machine Learning (ML) ...allows computers to find hidden insights without being programmed where to look 8 Machine Learning ● Decision Trees ● Naïve Bayes ● Clustering ● Neural Networks ● Etc. Deep Learning ● CNN ● RNN ● Transformer ● Autoencoder ● Etc. www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 8. Streaming Analytics for Predictive Maintenance at Scale 9 IoT Integration Layer Batch Analytics Platform BI Dashboard Streaming Platform Big Data Integration Layer Car Sensors Streaming Platform Analytics Platform Other Components Real Time Monitoring System All Data Critical Data Ingest Data Potential DetectAnalytics Platform Train Analytic Model Data Processing Analytic Model Preprocess Data Consume Data Deploy Analytic Model www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 9. The First Analytic Models 10 How to deploy the models in production? …real-time processing? …at scale? …24/7 zero uptime? www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 10. Hidden Technical Debt in Machine Learning Systems 11 https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 11. Scalable, Technology-Agnostic ML Infrastructures What is this thing used everywhere? https://www.infoq.com/presentations/netflix-ml-meson https://eng.uber.com/michelangelo https://www.infoq.com/presentations/paypal-data-service-fraud www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 12. A Streaming Platform - The Underpinning of an Event-Driven Architecture 15 Microservices DBs SaaS apps Mobile Customer 360 Real-time fraud detection Data warehouse Producers Consumers Database change Microservices events SaaS data Customer experiences Streams of real time events Stream processing apps Connectors Connectors Stream processing apps www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 13. Apache Kafka as Infrastructure for ML www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 14. Apache Kafka’s Open Ecosystem as Infrastructure for ML Kafka Streams/ ksqlDB Kafka Connect Confluent REST Proxy Confluent Schema Registry Go/.NET/Python Kafka Producer ksqlDB Python Client www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 15. Ingestion of IoT Data 20 Replication MirrorMaker / Confluent Replicator Kafka Connect Analytics / Machine Learning Ca rsCa rsCa rsCa rs Cars www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 16. Data Preprocessing 21 Preprocessing Filter, transform, anonymize, extract features Streams Data Ready For Model Training www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 17. Preprocessing with ksqlDB 22 SELECT car_id, event_id, car_model_id, sensor_input FROM car_sensor c LEFT JOIN car_models m ON c.car_model_id = m.car_model_id WHERE m.car_model_type ='Audi_A8'; www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 18. Data Ingestion into a Data Store for Model Training (and Consumption by other Decoupled Applications) 23 Connect Preprocessed Data Batch Near Real Time Real Time www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 19. Extreme scale usingTensorFlow and TPUs in the cloud! Analytic Model Model Training Using an Elastic Infrastructure in the Cloud 24www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 20. TensorFlow Model — Autoencoder for Anomaly Detection www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 25
  • 21. Direct streaming ingestion for model training with TensorFlow I/O + Kafka Plugin (no additional data storage like S3 or HDFS required!) Time Model BModel A Producer Distributed Commit Log Streaming Ingestion and Model Training with TensorFlow IO https://github.com/tensorflow/io 26 Model X (at a later time) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 22. Store Data Long-Term in Kafka? Today, Kafka works well for recent events, short horizon storage, and manual data balancing. Kafka’s present-day design offers extraordinarily low messaging latency by storing topic data on fast disks that are collocated with brokers. This is usually good. But sometimes, you need to store a huge amount of data for a long time. Kafka Processing App Storage Transactions, auth, quota enforcement, compaction, ... www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake
  • 23. Simplified Data Lake Architecture Tiered Storage for Kafka provides ● one platform for all data processing ● an event-based source of truth for materialized views ● no need for a pipeline between Kafka and a Data Lake like Hadoop Benefits ● cost reduction ● long-term backup ● performance isolation (real-time and historical analysis in the same cluster)
  • 24. Confluent Tiered Storage for Kafka Object Store Processing Storage Transactions, auth, quota enforcement, compaction, ... Local Remote Kafka Apps Store Forever Older data is offloaded to inexpensive object storage, permitting it to be consumed at any time. Save $$$ Storage limitations, like capacity and duration, are effectively uncapped. Instantaneously scale up and down Your Kafka clusters will be able to automatically self-balance load and hence elastically scale (Only available in Confluent Platform) www.kai-waehner.de | @KaiWaehner
  • 25. Confluent Tiered Storage for Kafka 30www.kai-waehner.de | @KaiWaehner (Only available in Confluent Platform)
  • 26. Use Cases for Reprocessing Historical Events Give me all events from time A to time B Real-time Producer Time • New consumer application • Error-handling • Compliance / regulatory processing • Query and analyze existing events • Model training Real-time Consumer Consumer of Historical Data www.kai-waehner.de | @KaiWaehner
  • 27. Local Predictions Model Training in Cloud Model Deployment at the Edge Analytic Model Separation of Model Training and Model Inference www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 32
  • 28. Streams Input Event Prediction Request Response Model Serving TensorFlow Serving gRPC / HTTP Application Stream Processing with External Model and RPC www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 33
  • 29. Prediction Stream Processing Model doPrediction() return value Stream Processing with Embedded Model Streams Input Event www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 34
  • 30. “CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and TensorFlow www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 35
  • 31. Streaming Analytics with Kafka and TensorFlow 36www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake MQTT Proxy MongoDB Storage MongoDB Dashboards Search Analytics Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem TensorFlow Other Components Kafka Streams Application All Data Critical Data Ingest Data Potential DetectTensorFlow Train Analytic Model ksqlDB Analytic Model Preprocess Data Consume Data Deploy Analytic Model Tiered Storage Mobile App BI Tool
  • 32. Demo: 100,000 Connected Cars (Kafka + ksqlDB + MQTT + TensorFlow) https://github.com/kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 37
  • 33. Live Demo www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake 38
  • 34. Machine Learning + Apache Kafka à Examples @ Github 39 https://github.com/kaiwaehner
  • 35. One pipeline to rule them all Real-time model scoring, batch model training, near-real time BI analytics Give me all events from time A to time B Car sensors (MQTT connector) Time Production infrastructure (Java) Data science / analytics infrastructure (Python + Jupyter) www.kai-waehner.de | @KaiWaehner | Streaming Machine Learning without a Data Lake