SlideShare a Scribd company logo
Serverless Kafka and Spark in a
Multi-Cloud Data Lakehouse Architecture
Kai Waehner
Field CTO
kai.waehner@confluent.io
linkedin.com/in/kaiwaehner
@KaiWaehner
confluent.io
kai-waehner.de
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Storage at Rest
USER
JAY
SUE
FRED
CREDIT_SCORE
695
430
710
V1
V3
V2
Analytics at Rest
SELECT * FROM
DB_TABLE
Active Query: Passive Data:
DB Table
Use Cases for Data at Rest
• Reporting
• Business Intelligence
• Data Engineering
• Big Data Analytics
• Machine Learning
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Apache Spark – The De Facto Standard for Big Data at Rest
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Big Data In Big Data Out
Big Data Storage and Processing
From Historical Data
to Insights
Delta Lake Open-source storage framework and open format for data analytics
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Real-time Data beats Slow Data.
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Real-time Data beats Slow Data.
Transportation
Real-time sensor
diagnostics
Driver-rider match
ETA updates
Insurance
Claim processing
Fraud detection
Omnichannel quote
processing
Retail
Real-time inventory
Real-time POS
reporting
Personalization
Entertainment
Real-time
recommendations
Personalized
news feed
In-app purchases
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Data at Rest Data in Motion
SELECT * FROM
DB_TABLE
CREATE TABLE T
AS SELECT * FROM
EVENT_STREAM
Active Query: Passive Data:
DB Table
Active Data: Passive Query:
Event Stream
Tables at
Rest
Streams in
Motion
USER
JAY
SUE
FRED
CREDIT_SCORE
695
430
710
V1
V3
V2
PAYMENTS
42
18
65
...
USER
JAY
SUE
FRED
...
Data Streaming = Data at Rest + Data in Motion
Payments Stream
Credit Score Stream
CREATE TABLE credit_scores AS
SELECT user, updateScore(p.amount)…
Apache Kafka – The De Facto Standard for Data in Motion
Database
CRM
Sensors
Mobile
Customer 360
Real-time
Alerting System
Data
warehouse
Producers
Consumers
Streams of real time events
Stream processing
apps
Connectors
Connectors
Stream processing
apps
Incident
Alert
Forecast
Pricing Customer
Order
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Data Lakehouse
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Building the Data Lakehouse
Author: Bill Inmon
Lakehouse is a
logical view, not physical!
Lambda Architecture
Option 1: Unified serving layer
Data
Source
Real-Time Layer
(Data Processing in Motion)
Batch Layer
(Data Processing at Rest)
Serving
Layer
Real-Time App
(Data Processing in Motion)
Batch App
(Data Processing at Rest)
ms
min/hr
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Data
Source
Real-Time Layer
(Data Processing in Motion)
Batch Layer
(Data Processing at Rest)
Real-time Query
Mixed Query
ms
min/hr
Speed
View
Batch
View
Batch Query
Lambda Architecture
Option 2: Separate serving layers
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Data
Source
Real-Time Layer
(Data Processing in Motion)
Real-Time App
(Data Processing in Motion)
Storage
Batch App
(Data Processing at Rest)
Storage
ms
min/hr
Storage
Kappa Architecture
One pipeline for real-time and batch consumers
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Kappa @ Uber
24
kai-waehner.de | @KaiWaehner | Kappa vs. Lambda Architecture
Confluent + Databricks Reference Architecture
Kafka
Connect
On Premises
or any cloud
Kafka Streams
& ksqlDB - real-time
stream processing
and transformations
Databricks Data
Science Workspace
Databricks
Delta Lake
Sink
Connector
for
Confluent
Cloud
(AWS)
Legacy Data Stores:
Netezza, Teradata
Oracle, Mainframes
Databases
IoT
Data Streaming Analytics
Sources
Data Streaming
Platform
built on Kafka
On Premises
or any cloud
Databricks BI
Workspace
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Connected Car Infrastructure at Audi
27
• Real Time Data Analysis
• Swarm Intelligence
• Collaboration with Partners
• Predictive AI
• …
https://www.youtube.com/watch?v=yGLKi3TMJv8
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Connected Car Infrastructure at Audi
28
https://www.youtube.com/watch?v=yGLKi3TMJv8
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Kappa Architecture for a Lakehouse with Kafka and Spark
MQTT
Proxy
Spark Core
Storage
Spark SQL
Reporting
Kafka
Cluster
Kafka
Connect
Car Sensors
Kafka Ecosystem
Spark Ecosystem
Other Components
Kafka
Streams
All
Data
Critical
Data
Ingest
Data
Potential
Detect
Spark
MLlib
Model
Training
ksqlDB
Model
Deployment
Preprocess
Data
Consume
Data
Deploy
Analytic
Model
Mobile App
BI Tool
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Machine Learning Model Training
with Spark MLlib
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j
“CREATE STREAM AnomalyDetection AS
SELECT sensor_id, detectAnomaly(sensor_values)
FROM car_engine;“
User Defined Function (UDF)
Model Deployment with
Apache Kafka, ksqlDB and Spark MLlib
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
MLlib
Stream Processing with Kafka or Spark?
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Kafka Streams /
ksqlDB
Spark
Streaming
Component of the
data streaming infrastructure
Low latency
Focus on 24/7 operations
Lightweight, decoupled
microservices
Component of the data
analytics infrastructure
Strong integration with the rest
of the Spark ecossytem
Stream and batch
Machine Learning “embedded”
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Cloud-Native Deployment
à Elastic Infrastructure and Faster Time-to-Market
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
You
Manage
Provider
Managed
Self-managed IaaS Hosted Cloud Service Fully Managed SaaS
Scaling Scaling Scaling
Load balancing Load balancing Load balancing
Partition placement Partition placement Partition placement
Logical Storage Logical Storage Logical Storage
Broker settings Broker settings Broker settings
Zookeeper Zookeeper Zookeeper
Kafka patching Kafka patching Kafka patching
JVM JVM JVM
O/S O/S O/S
VMs VMs VMs
Servers Servers Servers
Provider
managed
features
Product
ease of use
Fully Managed
Partially Managed
Self-Managed
What is a (truly) fully-managed SaaS?
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Agenda
• Data Analytics at Rest
• Data Streaming in Motion
• Lakehouse: Data Streaming + Analytics
• A Lakehouse Example: Intelligent Connected Cars
• Cloud-Native vs. Serverless Infrastructure
• Central vs. Hybrid and Global Data Mesh
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
AWS Cloud Outage hit Disney World Visitors…
https://www.cnet.com/tech/services-and-software/disney-parks-were-already-facing-heat-from-fans-then-an-aws-outage-came-along/
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Disaster Recovery – RPO and RTO
RPO = Recovery Point Objective
RTO = Recovery Time Objective
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Use Cases for Hybrid and Multi-Cloud Data Lakehouses
• Disaster Recovery and High Availability:
Create a disaster recovery cluster, and fail
over to it during an outage.
• Global and Multi-Cloud Replication: Move
and aggregate data across regions and
clouds.
• Data Sharing: Share data with other teams,
lines-of-business, or organizations.
• Data Migration: Migrate data and workloads
from one cluster to another (like from legacy
on-premise data warehouse to cloud-native
data lakehouse).
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
Data Replication
at Rest
or in Motion?
Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc.
Global Data Lakehouse across Edge and Hybrid Cloud
Streaming Replication between Kafka Clusters
Bridge to Databases, Data Lakes, Apps, APIs, SaaS
Aggregation of Edge
Deployments with
Replication (Aggregation)
Disaster Recovery
Operations with
Multi-Region Clusters
for RPO=0 and RTO~0
Global Data Streaming with
Replication and Cluster Linking
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
A data mesh for decentralized data products
Data
Product
Independent Data Products
for Reporting, Analytics,
Data Streaming
kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
For instance:
A KSQL microservice
Kai Waehner
Field CTO
kai.waehner@confluent.io
@KaiWaehner
confluent.io
kai-waehner.de
linkedin.com/in/kaiwaehner
Questions? Feedback?
Let’s connect!

More Related Content

What's hot (20)

PDF
Introducing Databricks Delta
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PPTX
Azure data platform overview
James Serra
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PPTX
Introduction to Azure Databricks
James Serra
 
PPTX
Azure Synapse Analytics Overview (r2)
James Serra
 
PDF
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 
Introducing Databricks Delta
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Azure data platform overview
James Serra
 
Free Training: How to Build a Lakehouse
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Making Apache Spark Better with Delta Lake
Databricks
 
Introduction to Stream Processing
Guido Schmutz
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Kafka 101 and Developer Best Practices
confluent
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Owning Your Own (Data) Lake House
Data Con LA
 
Introduction to Azure Databricks
James Serra
 
Azure Synapse Analytics Overview (r2)
James Serra
 
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Modernizing to a Cloud Data Architecture
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
The Top 5 Apache Kafka Use Cases and Architectures in 2022
Kai Wähner
 

Similar to Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture (20)

PDF
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
PDF
Apache Kafka® and Analytics in a Connected IoT World
confluent
 
PDF
Navigating Your Data Landscape With Siddharth Desai and Elena Cuevas | Curren...
HostedbyConfluent
 
PDF
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
Kai Wähner
 
PPTX
IoT and Event Streaming at Scale with Apache Kafka
confluent
 
PDF
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Kai Wähner
 
PDF
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent
 
PDF
Serverless Data Platform
Shu-Jeng Hsieh
 
PDF
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
PDF
Apache Kafka in the Airline, Aviation and Travel Industry
Kai Wähner
 
PDF
App modernization on AWS with Apache Kafka and Confluent Cloud
Kai Wähner
 
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
confluent
 
PDF
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Kai Wähner
 
PPTX
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
PDF
Set Your Data In Motion - CTO Roundtable
confluent
 
PDF
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
PDF
Apache Kafka for Smart Grid, Utilities and Energy Production
Kai Wähner
 
Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Kai Wähner
 
Apache Kafka® and Analytics in a Connected IoT World
confluent
 
Navigating Your Data Landscape With Siddharth Desai and Elena Cuevas | Curren...
HostedbyConfluent
 
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
Kai Wähner
 
IoT and Event Streaming at Scale with Apache Kafka
confluent
 
Serverless Kafka on AWS as Part of a Cloud-native Data Lake Architecture
Kai Wähner
 
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
HostedbyConfluent
 
Serverless Data Platform
Shu-Jeng Hsieh
 
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Apache Kafka in the Airline, Aviation and Travel Industry
Kai Wähner
 
App modernization on AWS with Apache Kafka and Confluent Cloud
Kai Wähner
 
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kai Wähner
 
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
HostedbyConfluent
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
confluent
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Kai Wähner
 
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Set Your Data In Motion - CTO Roundtable
confluent
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
Apache Kafka for Smart Grid, Utilities and Energy Production
Kai Wähner
 
Ad

More from Kai Wähner (20)

PDF
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Kai Wähner
 
PDF
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kai Wähner
 
PDF
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Kai Wähner
 
PDF
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Kai Wähner
 
PDF
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Kai Wähner
 
PDF
Apache Kafka in the Healthcare Industry
Kai Wähner
 
PDF
Apache Kafka in the Healthcare Industry
Kai Wähner
 
PDF
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Kai Wähner
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Kai Wähner
 
PDF
Apache Kafka Landscape for Automotive and Manufacturing
Kai Wähner
 
PDF
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Kai Wähner
 
PDF
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Kai Wähner
 
PDF
Apache Kafka in the Transportation and Logistics
Kai Wähner
 
PDF
Apache Kafka for Cybersecurity and SIEM / SOAR Modernization
Kai Wähner
 
PDF
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Kai Wähner
 
PDF
IBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
Kai Wähner
 
PDF
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Kai Wähner
 
PDF
Apache Kafka in the Insurance Industry
Kai Wähner
 
PDF
Apache Kafka and MQTT - Overview, Comparison, Use Cases, Architectures
Kai Wähner
 
Apache Kafka as Data Hub for Crypto, NFT, Metaverse (Beyond the Buzz!)
Kai Wähner
 
Kafka for Live Commerce to Transform the Retail and Shopping Metaverse
Kai Wähner
 
Apache Kafka vs. Cloud-native iPaaS Integration Platform Middleware
Kai Wähner
 
Resilient Real-time Data Streaming across the Edge and Hybrid Cloud with Apac...
Kai Wähner
 
Data Streaming with Apache Kafka in the Defence and Cybersecurity Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka in the Healthcare Industry
Kai Wähner
 
Apache Kafka for Real-time Supply Chain in the Food and Retail Industry
Kai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Apache Kafka for Predictive Maintenance in Industrial IoT / Industry 4.0
Kai Wähner
 
Apache Kafka Landscape for Automotive and Manufacturing
Kai Wähner
 
Apache Kafka in the Public Sector (Government, National Security, Citizen Ser...
Kai Wähner
 
Telco 4.0 - Payment and FinServ Integration for Data in Motion with 5G and Ap...
Kai Wähner
 
Apache Kafka in the Transportation and Logistics
Kai Wähner
 
Apache Kafka for Cybersecurity and SIEM / SOAR Modernization
Kai Wähner
 
Apache Kafka in the Automotive Industry (Connected Vehicles, Manufacturing 4....
Kai Wähner
 
IBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
Kai Wähner
 
Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?
Kai Wähner
 
Apache Kafka in the Insurance Industry
Kai Wähner
 
Apache Kafka and MQTT - Overview, Comparison, Use Cases, Architectures
Kai Wähner
 
Ad

Recently uploaded (20)

PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Import Data Form Excel to Tally Services
Tally xperts
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 

Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture

  • 1. Serverless Kafka and Spark in a Multi-Cloud Data Lakehouse Architecture Kai Waehner Field CTO [email protected] linkedin.com/in/kaiwaehner @KaiWaehner confluent.io kai-waehner.de
  • 2. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 3. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 5. Analytics at Rest SELECT * FROM DB_TABLE Active Query: Passive Data: DB Table
  • 6. Use Cases for Data at Rest • Reporting • Business Intelligence • Data Engineering • Big Data Analytics • Machine Learning kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 7. Apache Spark – The De Facto Standard for Big Data at Rest kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Big Data In Big Data Out Big Data Storage and Processing From Historical Data to Insights
  • 8. Delta Lake Open-source storage framework and open format for data analytics kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 9. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 10. Real-time Data beats Slow Data. kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 11. Real-time Data beats Slow Data. Transportation Real-time sensor diagnostics Driver-rider match ETA updates Insurance Claim processing Fraud detection Omnichannel quote processing Retail Real-time inventory Real-time POS reporting Personalization Entertainment Real-time recommendations Personalized news feed In-app purchases kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 12. Data at Rest Data in Motion SELECT * FROM DB_TABLE CREATE TABLE T AS SELECT * FROM EVENT_STREAM Active Query: Passive Data: DB Table Active Data: Passive Query: Event Stream
  • 14. Data Streaming = Data at Rest + Data in Motion Payments Stream Credit Score Stream CREATE TABLE credit_scores AS SELECT user, updateScore(p.amount)…
  • 15. Apache Kafka – The De Facto Standard for Data in Motion Database CRM Sensors Mobile Customer 360 Real-time Alerting System Data warehouse Producers Consumers Streams of real time events Stream processing apps Connectors Connectors Stream processing apps Incident Alert Forecast Pricing Customer Order kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 16. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 17. Data Lakehouse kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Building the Data Lakehouse Author: Bill Inmon Lakehouse is a logical view, not physical!
  • 18. Lambda Architecture Option 1: Unified serving layer Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Serving Layer Real-Time App (Data Processing in Motion) Batch App (Data Processing at Rest) ms min/hr kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 19. Data Source Real-Time Layer (Data Processing in Motion) Batch Layer (Data Processing at Rest) Real-time Query Mixed Query ms min/hr Speed View Batch View Batch Query Lambda Architecture Option 2: Separate serving layers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 20. Data Source Real-Time Layer (Data Processing in Motion) Real-Time App (Data Processing in Motion) Storage Batch App (Data Processing at Rest) Storage ms min/hr Storage Kappa Architecture One pipeline for real-time and batch consumers kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 21. Kappa @ Uber 24 kai-waehner.de | @KaiWaehner | Kappa vs. Lambda Architecture
  • 22. Confluent + Databricks Reference Architecture Kafka Connect On Premises or any cloud Kafka Streams & ksqlDB - real-time stream processing and transformations Databricks Data Science Workspace Databricks Delta Lake Sink Connector for Confluent Cloud (AWS) Legacy Data Stores: Netezza, Teradata Oracle, Mainframes Databases IoT Data Streaming Analytics Sources Data Streaming Platform built on Kafka On Premises or any cloud Databricks BI Workspace kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 23. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 24. Connected Car Infrastructure at Audi 27 • Real Time Data Analysis • Swarm Intelligence • Collaboration with Partners • Predictive AI • … https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 25. Connected Car Infrastructure at Audi 28 https://www.youtube.com/watch?v=yGLKi3TMJv8 kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 26. Kappa Architecture for a Lakehouse with Kafka and Spark MQTT Proxy Spark Core Storage Spark SQL Reporting Kafka Cluster Kafka Connect Car Sensors Kafka Ecosystem Spark Ecosystem Other Components Kafka Streams All Data Critical Data Ingest Data Potential Detect Spark MLlib Model Training ksqlDB Model Deployment Preprocess Data Consume Data Deploy Analytic Model Mobile App BI Tool kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 27. Machine Learning Model Training with Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe https://dev.to/siddhantpatro/spark-mllib-for-big-data-and-machine-learning-330j
  • 28. “CREATE STREAM AnomalyDetection AS SELECT sensor_id, detectAnomaly(sensor_values) FROM car_engine;“ User Defined Function (UDF) Model Deployment with Apache Kafka, ksqlDB and Spark MLlib kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe MLlib
  • 29. Stream Processing with Kafka or Spark? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Kafka Streams / ksqlDB Spark Streaming Component of the data streaming infrastructure Low latency Focus on 24/7 operations Lightweight, decoupled microservices Component of the data analytics infrastructure Strong integration with the rest of the Spark ecossytem Stream and batch Machine Learning “embedded”
  • 30. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 31. Cloud-Native Deployment à Elastic Infrastructure and Faster Time-to-Market kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 32. You Manage Provider Managed Self-managed IaaS Hosted Cloud Service Fully Managed SaaS Scaling Scaling Scaling Load balancing Load balancing Load balancing Partition placement Partition placement Partition placement Logical Storage Logical Storage Logical Storage Broker settings Broker settings Broker settings Zookeeper Zookeeper Zookeeper Kafka patching Kafka patching Kafka patching JVM JVM JVM O/S O/S O/S VMs VMs VMs Servers Servers Servers Provider managed features Product ease of use Fully Managed Partially Managed Self-Managed What is a (truly) fully-managed SaaS? kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 33. Agenda • Data Analytics at Rest • Data Streaming in Motion • Lakehouse: Data Streaming + Analytics • A Lakehouse Example: Intelligent Connected Cars • Cloud-Native vs. Serverless Infrastructure • Central vs. Hybrid and Global Data Mesh kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 34. AWS Cloud Outage hit Disney World Visitors… https://www.cnet.com/tech/services-and-software/disney-parks-were-already-facing-heat-from-fans-then-an-aws-outage-came-along/ kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 35. Disaster Recovery – RPO and RTO RPO = Recovery Point Objective RTO = Recovery Time Objective kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 36. Use Cases for Hybrid and Multi-Cloud Data Lakehouses • Disaster Recovery and High Availability: Create a disaster recovery cluster, and fail over to it during an outage. • Global and Multi-Cloud Replication: Move and aggregate data across regions and clouds. • Data Sharing: Share data with other teams, lines-of-business, or organizations. • Data Migration: Migrate data and workloads from one cluster to another (like from legacy on-premise data warehouse to cloud-native data lakehouse). kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe Data Replication at Rest or in Motion?
  • 37. Copyright 2021, Confluent, Inc. All rights reserved. This document may not be reproduced in any manner without the express written permission of Confluent, Inc. Global Data Lakehouse across Edge and Hybrid Cloud Streaming Replication between Kafka Clusters Bridge to Databases, Data Lakes, Apps, APIs, SaaS Aggregation of Edge Deployments with Replication (Aggregation) Disaster Recovery Operations with Multi-Region Clusters for RPO=0 and RTO~0 Global Data Streaming with Replication and Cluster Linking kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe
  • 38. A data mesh for decentralized data products Data Product Independent Data Products for Reporting, Analytics, Data Streaming kai-waehner.de | @KaiWaehner | Serverless Apache Kafka and Spark across the Globe For instance: A KSQL microservice