SlideShare a Scribd company logo
Copyright © SAS Institute Inc. All rights reserved.
Copyright © SAS Institute Inc. All rights reserved.
Zero Down Time Move from Apache
Kafka to Confluent using the
Confluent Operator and Cluster
Linking
Justin
Dempsey
Senior Software Development
Manager, SAS Cloud
justin.dempsey@sas.com
x
x
ABOUT US
x
x
x
Customer Intelligence
CI360
x
CI360 merges digital & traditional data sources for a
"360" view promoting a complete picture of your
customer
360 View
A/B testing and other features allow for users to
evaluate the effectiveness of their marketing
campaigns and creatives
"What-if" Creative
Content Analysis
Cloud native software as a service solution utilizing
modern architecture principals to ensure a secure and
performant approach to managing your marketing
campaigns
SaaS Microservices
Architecture
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Current 2022
Monolithic VM backed
Kafka - Hardware and OS
Scaling options &
Adaptability to Seasonal
Demands
Cross Cloud Future State
Doubling YoY Kafka event
Volume - Additional
Regions
EOS/EOL OS Concerns
Inability to add capacity
quickly
CI360 KAFKA USAGE
(2020)
WEAKNESSES OPPORTUNITIES THREATS
Processing Trillion+
Events (2020) w/o
disruption
Globally Distributed
Presence
STRENGTHS
x
x
_
What Problem are we trying
to Solve?
"Scalability issues with self-supporting
open source Kafka in-house were
making it difficult to adjust to changing
demand, and operational overhead and
complexity were driving significant costs..."
We had to evolve and adapt...
- https://www.confluent.io/customers/sas/
address aging OS, hardware, and
supportive software EOS/EOL
Objective 1
lower Total Cost of Ownership
(TCO)
Objective 2
reduce complexity for adding
capacity
Objective 3
minimize SLA support risk(s)
Objective 4
Objectives
1. All migrations complete prior to Calendar Q4/2021
2. No data loss
3. No SLA/SLO violations during migration
4. ISO certification requirements for Kafka complete
5. Reduction in TCO
Criteria For Success
3
Options
Confluent
Confluent provides
services and
technolog(ies)
that are well suited for
the CI360 SaaS Solution
and
that will help the
product grow and
evolve as CI360
expands into additional
Data Centers and Cloud
Providers
Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Current 2022
vs
CLOUD SELF-
MANAGED
Kafka on K8S
Proposed Architecture
In order to move from the original kafka cluster on VM to Kafka on Kubernetes, we needed to address
three primary kafka usage patterns.
Migration Problem Statement
1
Raw Data Ingest
2 Transformational
Topics
3
Agent / External
Option 1: MirrorMaker 1 w/ no Kafka Upgrade
Flow Details
PROS:
o Setup a new cluster with Kafka 2.7 in Kubernetes
o Make a list of all customer configs for all topics present
o Start MM1
o Validate all topics and update the custom configs
o Migrate consumers
o Migrate producers
o Stop MM1
o No upgrade to 2.7 for existing clusters
CONS:
o Topic configs are not copied
o Offsets are not preserved from source to destination
o Topic partitions are not preserved ~ No guarantee of order across partitions.
o No UI/Portal integration - status monitoring brittle
Producer Broker Consumer
Producer Broker Consumer
SAS Cloud
Kafka 2.3 (Original)
Kafka 2.7 (Future)
MirrorMaker1
Option 2: MirrorMaker 2 w/ Upgrade (Minimum 2.4)
Flow Details
PROS:
o Upgrade kafka in existing clusters to 2.5.x
o Setup new cluster with minimum Kafka 2.7 in Kubernetes
o Start MM2
o Migrate consumers
o Migrate producers
o Stop MM2
o Offset Translation
o Consumer Group Checkpoints
o Cross Cluster Replication metrics
o High level driver and configuration file for defining global topology
o Partition synchronization
o Configuration synchronized
CONS:
o Prerequisite step of upgrading existing clusters to a minimum of 2.4.x
o Uses Connect Converters
o Not byte-level
o No UI/Portal integration - status monitoring brittle
MirrorMaker2
Producer Broker Consumer
Producer Broker Consumer
SAS Cloud
Upgrade Original to Kafka 2.5.x
Kafka 2.7 (Future)
Option 3: Confluent Replicator w/Upgrade (Minimum 2.4)
Flow Details
PROS:
o Setup new cluster with minimum Kafka 2.7 in Kubernetes
o Make a list of all configs for all topics present
o Start Confluent Replicator
o Migrate consumers
o Migrate producers
o Stop Confluent Replicator
o Same as Replicator
o Multi-region replication and hybrid cloud replication
o Handles topic configuration and data
o Integrates with Kafka Connect and Confluent Control Center
CONS:
o Source must be (minimum) confluent platform 5.4 or Kafka 2.4 and destination
confluent 6.0 (minimum)
o Must Upgrade Original to minimum of 2.4
o Uses Connect Converters
o Not byte-level
Producer Broker Consumer
Producer Broker Consumer
SAS Cloud
Upgrade Original to Kafka 2.5.x
Kafka 2.7 (Future)
Confluent Replicator
Option 4: Confluent Cluster Linking w/Upgrade (Minimum 2.7)
Flow Details
PROS:
o Setup new cluster with minimum Kafka 2.7 in Kubernetes
o Create Cluster Link(s)
o Create source (if relevant) and mirror topic(s)
o Consume from the mirror topic(s) ("topic-to-link")
o Complete replication/consumer offset(s) to zero
o Migrate Consumers
o Migrate Producers
o Break the link(s)
o Offset Translation
o Consumer Group Checkpoints
o Cross Cluster Replication metrics
o High level driver and configuration file for defining global topology
o Partition synchronization
o Configuration synchronized
o Monitoring the replication process via Control Center
o Byte Level Replication
o Connect/VM/Custom Tooling not needed
CONS:
o Prerequisite step of upgrading existing clusters to a minimum of 2.7
o Cluster Linking (as of April 2021 - Pre-General Availability)
Producer Broker Consumer
Producer Broker Consumer
SAS Cloud
Upgrade Original to Kafka 2.7.x
Kafka 2.7 (Future)
Cluster Linking
oUtilize Cluster Linking for Kafka Migration
oUtilize Confluent Operator for Migration and Ongoing
Systems Management
Migration Key Technical Objectives
1. Cluster Linking usability maturity
a) Much orchestration to efficiently do parallel migrations/cutovers
b) No UI/UX facet(s)
c) Breaking the Mirror is a little "klunky"
d) No cluster level topic mirror configuration
a) Creating the links was a little tedious / an abstraction to create multiple links at one time or similar
would have been nice (wasn’t overly mature in early 2021 - we had to to write a simple loop abstraction
and loop through a dictionary of topic names)
e) Not integrated with Control Center
f) Active/Active configuration not practical (i.e., bidirectional – or “two-way” but same topic names)
2. Consumer Offset Mirroring is disconnected from the main mirroring mechanism for topics (a little brittle)
3. Fail-back options were very limited
4. Operator 2.0 (GA as of 5/11) – Working on PreGA
a) Missing Components
Challenges
2
Allow all Consumer
Groups to Drain
x x x x x
3
4
5
6
Migration Steps
Divert Ingest Topics
to "Hold"
3 days prior to
cutover
Migrate Schemas
No New Schema
Updates
Shutdown Consumer
Facing APIs
Setup Cluster Linking
1 Day for Replication
Monitor Lag
Break Cluster Linking
& Migrate Ingest Topics
- <30 minutes
x
7
8
Shutdown Consumer
and non-critical
Producer Services
Service Configuration
overide for handful of
Ingest Consumers
x
9
10
Update Global
Settings for Apps
to use New Cluster
Rolling Restart of
all Services
Bounce Config
Service
x
11
x
x x
Q42021
3 days prior
to Migration
Day of
Migration
Migration
Complete
x
1
Upgrade
Original Cluster
to 2.7
Review Objectives
Address aging OS,
hardware, and supportive
software EOS/EOL
Reduce complexity for
adding capacity
Lower Total Cost of
Ownership (TCO)
Minimize SLA support
risk(s)
Calendar
All migrations were
completed by October 2021
addressing Hardware &
Licensing Concerns going
into Q4
Certification
ISO Certification
requirements were met
going into October 2021
x
TCO
Projected Cost Reduction
69%
SLA/SLO met
No data Loss during the
migration and customers
were able to produce
messages at all times
supporting "minimal
disruption"
Application Workload Requirements
Direct Costs Indirect Costs Hidden Costs
Total Cost Of
Ownership
o Support Staff
o Downtime
o Addl Network
Mgmt
o Infrastructure Costs
o Software Purchase
Costs
o Administration
Costs
o Upgrades
o Maintenance
o Tech Support
o Training
Consideration Details
Infrastructure Costs Hardware, network, egress, & other factors
Timeline Hard dates for completion – October 1, 2021 - Confluent
Operator GA, EOL/EOS software dates, ISO certification
Technologies Confluent Cloud, Confluent Platform, Apache Kafka, Strimzi
Operator
Support Services Confluent support tiers, existing SAS Cloud support teams ~
SME’s for Kubernetes and Kafka
Licensing and Software Costs Confluent Cloud & Confluent Platform Pricing Structures
SAS Procurement & Compliance Process(es) &
Controls
Onboarding a new cloud PaaS solution could take months to
implement. Relationship with Confluent for on-premises
licensing had already begun going into 2021.
TCO Consideration Factors
1 YEAR LATER
Copyright © SAS Institute Inc. All rights reserved.
Copyright © SAS Institute Inc. All rights reserved.
Justin
Dempsey
Senior Software Development
Manager, SAS Cloud
justin.dempsey@sas.com

More Related Content

Similar to Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Current 2022 (20)

PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
PDF
The Bridge to Cloud (Peter Gustafsson, Confluent) London 2019 Confluent Strea...
confluent
 
PPTX
Data In Motion Paris 2023
confluent
 
PDF
App modernization on AWS with Apache Kafka and Confluent Cloud
Kai Wähner
 
PDF
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
confluent
 
PPTX
Keine Angst vorm Dinosaurier: Mainframe-Integration und -Offloading mit Confl...
Precisely
 
PDF
What's new in confluent platform 5.4 online talk
confluent
 
PDF
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
confluent
 
PDF
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Kai Wähner
 
PDF
Apache Kafka 2.3 + Confluent Platform 5.3 => What's New?
Kai Wähner
 
PDF
How Confluent Completes the Event Streaming Platform (Addison Huddy & Dan Ros...
HostedbyConfluent
 
PDF
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
confluent
 
PDF
Confluent Partner Tech Talk with Synthesis
confluent
 
PDF
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
PDF
How to eat a whale?
Kelly Looney
 
PDF
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
confluent
 
PDF
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
PDF
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
Kai Wähner
 
PDF
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
PDF
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
HostedbyConfluent
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
The Bridge to Cloud (Peter Gustafsson, Confluent) London 2019 Confluent Strea...
confluent
 
Data In Motion Paris 2023
confluent
 
App modernization on AWS with Apache Kafka and Confluent Cloud
Kai Wähner
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
confluent
 
Keine Angst vorm Dinosaurier: Mainframe-Integration und -Offloading mit Confl...
Precisely
 
What's new in confluent platform 5.4 online talk
confluent
 
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
confluent
 
Confluent Operator as Cloud-Native Kafka Operator for Kubernetes
Kai Wähner
 
Apache Kafka 2.3 + Confluent Platform 5.3 => What's New?
Kai Wähner
 
How Confluent Completes the Event Streaming Platform (Addison Huddy & Dan Ros...
HostedbyConfluent
 
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
confluent
 
Confluent Partner Tech Talk with Synthesis
confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
confluent
 
How to eat a whale?
Kelly Looney
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
confluent
 
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Henning Spjelkavik
 
New Features in Confluent Platform 6.0 / Apache Kafka 2.6
Kai Wähner
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
HostedbyConfluent
 
Streaming Time Series Data With Kenny Gorman and Elena Cuevas | Current 2022
HostedbyConfluent
 

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Ad

Zero Down Time Move From Apache Kafka to Confluent With Justin Dempsey | Current 2022

  • 1. Copyright © SAS Institute Inc. All rights reserved. Copyright © SAS Institute Inc. All rights reserved. Zero Down Time Move from Apache Kafka to Confluent using the Confluent Operator and Cluster Linking
  • 4. x x x Customer Intelligence CI360 x CI360 merges digital & traditional data sources for a "360" view promoting a complete picture of your customer 360 View A/B testing and other features allow for users to evaluate the effectiveness of their marketing campaigns and creatives "What-if" Creative Content Analysis Cloud native software as a service solution utilizing modern architecture principals to ensure a secure and performant approach to managing your marketing campaigns SaaS Microservices Architecture
  • 6. Monolithic VM backed Kafka - Hardware and OS Scaling options & Adaptability to Seasonal Demands Cross Cloud Future State Doubling YoY Kafka event Volume - Additional Regions EOS/EOL OS Concerns Inability to add capacity quickly CI360 KAFKA USAGE (2020) WEAKNESSES OPPORTUNITIES THREATS Processing Trillion+ Events (2020) w/o disruption Globally Distributed Presence STRENGTHS
  • 7. x x _ What Problem are we trying to Solve?
  • 8. "Scalability issues with self-supporting open source Kafka in-house were making it difficult to adjust to changing demand, and operational overhead and complexity were driving significant costs..." We had to evolve and adapt... - https://www.confluent.io/customers/sas/
  • 9. address aging OS, hardware, and supportive software EOS/EOL Objective 1 lower Total Cost of Ownership (TCO) Objective 2 reduce complexity for adding capacity Objective 3 minimize SLA support risk(s) Objective 4 Objectives
  • 10. 1. All migrations complete prior to Calendar Q4/2021 2. No data loss 3. No SLA/SLO violations during migration 4. ISO certification requirements for Kafka complete 5. Reduction in TCO Criteria For Success
  • 12. Confluent Confluent provides services and technolog(ies) that are well suited for the CI360 SaaS Solution and that will help the product grow and evolve as CI360 expands into additional Data Centers and Cloud Providers
  • 15. Kafka on K8S Proposed Architecture
  • 16. In order to move from the original kafka cluster on VM to Kafka on Kubernetes, we needed to address three primary kafka usage patterns. Migration Problem Statement
  • 17. 1 Raw Data Ingest 2 Transformational Topics 3 Agent / External
  • 18. Option 1: MirrorMaker 1 w/ no Kafka Upgrade Flow Details PROS: o Setup a new cluster with Kafka 2.7 in Kubernetes o Make a list of all customer configs for all topics present o Start MM1 o Validate all topics and update the custom configs o Migrate consumers o Migrate producers o Stop MM1 o No upgrade to 2.7 for existing clusters CONS: o Topic configs are not copied o Offsets are not preserved from source to destination o Topic partitions are not preserved ~ No guarantee of order across partitions. o No UI/Portal integration - status monitoring brittle Producer Broker Consumer Producer Broker Consumer SAS Cloud Kafka 2.3 (Original) Kafka 2.7 (Future) MirrorMaker1
  • 19. Option 2: MirrorMaker 2 w/ Upgrade (Minimum 2.4) Flow Details PROS: o Upgrade kafka in existing clusters to 2.5.x o Setup new cluster with minimum Kafka 2.7 in Kubernetes o Start MM2 o Migrate consumers o Migrate producers o Stop MM2 o Offset Translation o Consumer Group Checkpoints o Cross Cluster Replication metrics o High level driver and configuration file for defining global topology o Partition synchronization o Configuration synchronized CONS: o Prerequisite step of upgrading existing clusters to a minimum of 2.4.x o Uses Connect Converters o Not byte-level o No UI/Portal integration - status monitoring brittle MirrorMaker2 Producer Broker Consumer Producer Broker Consumer SAS Cloud Upgrade Original to Kafka 2.5.x Kafka 2.7 (Future)
  • 20. Option 3: Confluent Replicator w/Upgrade (Minimum 2.4) Flow Details PROS: o Setup new cluster with minimum Kafka 2.7 in Kubernetes o Make a list of all configs for all topics present o Start Confluent Replicator o Migrate consumers o Migrate producers o Stop Confluent Replicator o Same as Replicator o Multi-region replication and hybrid cloud replication o Handles topic configuration and data o Integrates with Kafka Connect and Confluent Control Center CONS: o Source must be (minimum) confluent platform 5.4 or Kafka 2.4 and destination confluent 6.0 (minimum) o Must Upgrade Original to minimum of 2.4 o Uses Connect Converters o Not byte-level Producer Broker Consumer Producer Broker Consumer SAS Cloud Upgrade Original to Kafka 2.5.x Kafka 2.7 (Future) Confluent Replicator
  • 21. Option 4: Confluent Cluster Linking w/Upgrade (Minimum 2.7) Flow Details PROS: o Setup new cluster with minimum Kafka 2.7 in Kubernetes o Create Cluster Link(s) o Create source (if relevant) and mirror topic(s) o Consume from the mirror topic(s) ("topic-to-link") o Complete replication/consumer offset(s) to zero o Migrate Consumers o Migrate Producers o Break the link(s) o Offset Translation o Consumer Group Checkpoints o Cross Cluster Replication metrics o High level driver and configuration file for defining global topology o Partition synchronization o Configuration synchronized o Monitoring the replication process via Control Center o Byte Level Replication o Connect/VM/Custom Tooling not needed CONS: o Prerequisite step of upgrading existing clusters to a minimum of 2.7 o Cluster Linking (as of April 2021 - Pre-General Availability) Producer Broker Consumer Producer Broker Consumer SAS Cloud Upgrade Original to Kafka 2.7.x Kafka 2.7 (Future) Cluster Linking
  • 22. oUtilize Cluster Linking for Kafka Migration oUtilize Confluent Operator for Migration and Ongoing Systems Management Migration Key Technical Objectives
  • 23. 1. Cluster Linking usability maturity a) Much orchestration to efficiently do parallel migrations/cutovers b) No UI/UX facet(s) c) Breaking the Mirror is a little "klunky" d) No cluster level topic mirror configuration a) Creating the links was a little tedious / an abstraction to create multiple links at one time or similar would have been nice (wasn’t overly mature in early 2021 - we had to to write a simple loop abstraction and loop through a dictionary of topic names) e) Not integrated with Control Center f) Active/Active configuration not practical (i.e., bidirectional – or “two-way” but same topic names) 2. Consumer Offset Mirroring is disconnected from the main mirroring mechanism for topics (a little brittle) 3. Fail-back options were very limited 4. Operator 2.0 (GA as of 5/11) – Working on PreGA a) Missing Components Challenges
  • 24. 2 Allow all Consumer Groups to Drain x x x x x 3 4 5 6 Migration Steps Divert Ingest Topics to "Hold" 3 days prior to cutover Migrate Schemas No New Schema Updates Shutdown Consumer Facing APIs Setup Cluster Linking 1 Day for Replication Monitor Lag Break Cluster Linking & Migrate Ingest Topics - <30 minutes x 7 8 Shutdown Consumer and non-critical Producer Services Service Configuration overide for handful of Ingest Consumers x 9 10 Update Global Settings for Apps to use New Cluster Rolling Restart of all Services Bounce Config Service x 11 x x x Q42021 3 days prior to Migration Day of Migration Migration Complete x 1 Upgrade Original Cluster to 2.7
  • 25. Review Objectives Address aging OS, hardware, and supportive software EOS/EOL Reduce complexity for adding capacity Lower Total Cost of Ownership (TCO) Minimize SLA support risk(s)
  • 26. Calendar All migrations were completed by October 2021 addressing Hardware & Licensing Concerns going into Q4 Certification ISO Certification requirements were met going into October 2021 x TCO Projected Cost Reduction 69% SLA/SLO met No data Loss during the migration and customers were able to produce messages at all times supporting "minimal disruption"
  • 27. Application Workload Requirements Direct Costs Indirect Costs Hidden Costs Total Cost Of Ownership o Support Staff o Downtime o Addl Network Mgmt o Infrastructure Costs o Software Purchase Costs o Administration Costs o Upgrades o Maintenance o Tech Support o Training
  • 28. Consideration Details Infrastructure Costs Hardware, network, egress, & other factors Timeline Hard dates for completion – October 1, 2021 - Confluent Operator GA, EOL/EOS software dates, ISO certification Technologies Confluent Cloud, Confluent Platform, Apache Kafka, Strimzi Operator Support Services Confluent support tiers, existing SAS Cloud support teams ~ SME’s for Kubernetes and Kafka Licensing and Software Costs Confluent Cloud & Confluent Platform Pricing Structures SAS Procurement & Compliance Process(es) & Controls Onboarding a new cloud PaaS solution could take months to implement. Relationship with Confluent for on-premises licensing had already begun going into 2021. TCO Consideration Factors
  • 30. Copyright © SAS Institute Inc. All rights reserved. Copyright © SAS Institute Inc. All rights reserved. Justin Dempsey Senior Software Development Manager, SAS Cloud [email protected]