SlideShare a Scribd company logo
Flink Forward 2024 ©
Exploring Scenarios of Flink CDC
in Streaming Data Integration
Leonard Xu @ Alibaba Cloud
Flink PMC Member & Committer, Flink CDC Lead
Flink Forward 2024 ©
Flink CDC Overview
01
Why Flink CDC YAML
02
CDC YAML Internals
03
Community and Future Plan
04
Flink Forward 2024 ©
Flink CDC Overview
Flink Forward 2024 ©
What is Flink CDC
Flink CDC is a streaming data integration tool that implements unified snapshot reading and
incremental reading based on the CDC (Change Data Capture) technology of database logs.
Combined with Flink's excellent pipeline capabilities and rich upstream and downstream
ecosystem, Flink CDC can efficiently achieve real-time integration of massive data.
Flink CDC
Real-time Snapshot
Data Snapshot
Change Log
Flink Forward 2024 ©
Usages of Flink CDC
Data
Synchronization
Real-time
Materialized view
Data Distribution Data Integration
Flink CDC
Flink Forward 2024 ©
Traditional CDC Pipeline
DataX / Sqoop
Snapshot Sync
Debezium / Canal
Changelog Sync
Merge
Merged Table
Incremental Table
Snapshot Table
Data INFRAs
Data
Consistency
Data Freshness Data Stack
😔 😔 😔 😔
DB
Flink Forward 2024 ©
With Flink CDC
Canal / Debezium
Changelog Sync
DataX / Sqoop
Snapshot Sync
Real-time
Snapshot
CDC Source Sink
Custom Logics
…
Merge
Merged Table
Incremental Table
Snapshot Table
Unified Sync Exactly-Once Low Latency One Flink Job
😄 😄 😄 😄
Flink Forward 2024 ©
Our Milestones
2020 / 07
Kick Off
2021 / 08
Release 2.0
First version of MySQL CDC and Postgres CDC Connector
2022 / 11
Release 2.3
MySQL CDC implements Incremental Snapshot Algorithm
2023 / 10
Release 3.0
YAML API, end to end streaming data integration tool
2024 / 01
Donated to ASF
Supports transform (Projection,Filter, UDF) in YAML API
2024 / 09
Release 3.2
Incremental Snapshot Framework, cover key connectors
Flink Forward 2024 ©
Transform(T) Load(L)
Extraction(E)
Flink
Flink
Debezium
TiDB
ClickHouse
Iceberg
Hudi
Paimon
TiDB
ClickHouse
Iceberg
Hudi
Paimon
MySQL CDC Source
Flink CDC 1.x: CDC Source Connector
Flink Forward 2024 ©
2) Scan Snapshot Data of table
4) Append Changelog of table
JDBC connection
Binlog
connection
DB
1) Lock table for data consistency
3) Release table lock after scan
Flink CDC 1.x: CDC Source Connector
Flink Forward 2024 ©
Flink CDC 2.0: Incremental Snapshot Algorithm
DB
JDBC con
JDBC con
JDBC con
binlog con
No-Lock Algorithm
Task1
Task2
Task3
Parallel Snapshot Scan Changelog dump
Auto Switch
chunk2
chunk3
chunk1
Task4
Flink Forward 2024 ©
Flink CDC 2.x: Incremental Snapshot Framework
No-Lock Algorithm
Task1
Task2
Task3
Parallel Snapshot Scan Changelog dump
Auto Switch
chunk2
chunk3
chunk1
Task4
ApsaraDB MySQL
More sources are on the way
Flink Forward 2024 ©
Flink CDC 2.x Recap
Scalability Copyright
Integrity
● Only source,not a
entire data pipeline
● SQL job’s schema is
fixed, each table
need an operator
● What if upstream
schema changed
● What if upstream
tables added
● Sync thousands of
tables once
● Sync Entire DB in
one pipeline
● Apache V2 License
● Project belongs to
Ververica
Maintainability
Flink Forward 2024 ©
Flink CDC 3.0 Motivation
Automation Flexible Donation
End to End
● End to End data
pipeline
● YAML API, simple
but powerful
● Automated schema
evolution
● Automated newly
table capture
● Full DB sync via
regular expression
● Flush to multiple
tables in one sink
● Donate Flink CDC
to Apache Flink
● Project belongs to
ASF
Flink Forward 2024 ©
Flink CDC 3.0: End to end streaming data integration tool
AI / ML
Analytics / BI
Database
Data Lake
Data Warehouse
Flink CDC
Flink Forward 2024 ©
$> flink-cdc.sh mysql-to-doris.yaml
mysql-to-doris.yaml
Simple but Powerful !
Flink CDC 3.0: End to end streaming data integration tool
Flink CDC 3.0:Donate to Apache Flink
Flink Forward 2024 ©
Why Flink CDC YAML
Flink Forward 2024 ©
YAML API:Design for Data Integration Users
Users Don’t Care How it works
Users Care What they Need
Paimon
Flink Forward 2024 ©
Write Pipeline via YAML API
YAML Language
Easy to write for user
Friendly for machine transmission
Focus on Data Integration Scenarios
Just specify sync source and destination
Routing and transformations
No PhD in Flink needed
Flink Forward 2024 ©
TiDB
Hologres
ClickHouse
Iceberg
Hudi
…
TiDB
ApsaraDB MySQL
Flink CDC APIs
SELECT WHERE
JOIN Top-N
Flink SQL API
GROUP BY
INSERT
map filter
join
Flink DataStream API
keyBy
flatMap
Schema
Evolution
Schema
Sync
SELECT Filter
CDC YAML API
Full DB
Sync
UDF
aggregate
Flink Forward 2024 ©
Flink CDC APIs
Flink Runtime
Flink CDC Runtime
CDC Sources
SQL API YAML API
YAML Sinks
CDC client
(TiDB,MongoDB…)
Debezium
(MySQL,PG,Oracle..)
DataStream API
Flink Forward 2024 ©
RowData
Delete
Insert
Update
Before
Update
After
Flink SQL Pipeline
DataChangeEvent
Delete
Insert
Update
Before
After
CreateTableEvent
SchemaChangeEvent
AddColumnEvent
TruncateTableEvent
…
CDC YAML Pipeline
SQL Pipeline vs YAML Pipeline
Flink Forward 2024 ©
Flink SQL Pipeline CDC YAML Pipeline
Manually write create table and insert
into
😔
Cannot process schema change
😔
Break original changelog Update
😔
Read/Write single table in
TableSource/TableSink operator
😔
Schema discovery, Full DB synchronize
😊
Schema Evolution with multiple strategies
😊
Original changelog synchronize
😊
Read/Write multiple tables in Source/Sink
😊
SQL Pipeline vs YAML Pipeline
Flink Forward 2024 ©
StreamRecord
Flink DataStream Pipeline
DataChangeEvent
CreateTableEvent
SchemaChangeEvent
CDC YAML Pipeline
op (user, STRING) (id, INT)
(user, STRING) (id, INT)
+I Leonard 1
op (user, STRING) (id, INT)
U Leonard 1 3
op (user, STRING) (id, INT)
-D Leonard 3
+I Leonard 1
-D Leonard 3
BinaryData + Schemaless
DataStream Pipeline vs YAML Pipeline
U Leonard 1 3
Flink Forward 2024 ©
Flink DataStream Pipeline CDC YAML Pipeline
DataStream Pipeline vs YAML Pipeline
Java expert required, distributed system
programing
😔
Flink expert required,DataStream API,
State, Checkpoint, Runtime
😔
Skills for Maven、 Dependency
management
😔
Hard to reuse even you’ve implemented
one
😔
Design for all users instead of experts
😊
Build powerful pipeline via YAML,
underlying details is hidden
😊
YAML is easy to understand and learn
😊
Easy to create new pipeline with Ctrl + C/V
😊
Flink Forward 2024 ©
CDC YAML Internals
YAML Overall Design
YAML
Flink CDC CLI
Flink CDC Composer
Streaming Pipeline Schema Evolution Full DB Sync Sharded Table Sync
Change Data Capture Batch Pipeline Schema Inference
Flink CDC Runtime DataSource Operator DataSink Operator Schema Registry Router Transformer
Flink Runtime
YARN
Kubernetes Standalone
Flink CDC Connect MySQL Source Doris Sink StarRocks Sink …
Flink CDC Composer
Flink CDC API
Flink Forward 2024 ©
YAML Connector API
DataSource
MetadataAccessor
EventSourceProvider
FlinkSourceProvider FlinkSourceFunctionProvider
DataSink
EventSinkProvider
FlinkSinkProvider FlinkSinkFunctionProvider
MetadataApplier
Flink Forward 2024 ©
YAML Key Feature: Schema Evolution
SchemaChangeEvent
DataChangeEvent
FlushEvent
SchemaRegistry
5⃣ Sink notifies flush complete
6⃣ Schema registry applies schema change
1⃣ Schema operator receives
SchemaChangeEvent
2⃣ Schema operator registers schema
change then wait for response (hold
upstream). Blocks if SchemaRegistry is
busy
3⃣ Schema registry
accepts schema change, and
rejects following requests
4⃣ Schema operator broadcasts
FlushEvent, requests registry
again to wait for flush complete
7⃣ Schema registry
confirms schema evolution
completes, ready for next request
8⃣ Schema operator
releases upstream
2⃣ ther schema operator must wait until
other schema changes are completes
Schema Operator DataSink
Post Partitioner
DataSource
MetadataApplier
APPLYING
IDLE
FINISHED
WAITING
Paimon
Flink Forward 2024 ©
YAML Key Feature: Fine-grained Schema Evolution
Paimon
Lenient Mode (Default)
Keeps data integrity, and provides recoverability
Ignore
Ignores any schema changes
Try Evolve
Tries to apply schema changes, cast data records if fails
Evolve
Apply schema changes, terminates the job if fails
Exception
Rejects any schema changes, terminates once occurred
SchemaRegistry
Table Upstream
Schema
db.table1 {id INT, …}
Table Evolved
Schema
db.table1 {id INT, …}
Flink Forward 2024 ©
$> flink-cdc.sh mysql-to-doris.yaml
mysql-to-doris.yaml
app_db
products
id product
1 Beer
2 Cap
3 Peanut
orders
id price amount
1 4 null
2 100 null
3 23 234
shipments
id product
1 Beer
2 Cap
3 Peanut
products
id product
1 Beer
2 Cap
3 Peanut
orders
id price amount
1 4 null
2 100 null
3 23 234
shipments
id product
1 Beer
2 Cap
3 Peanut
app_db
Simple but Powerful !
YAML Key Feature: Schema Evolution
Flink Forward 2024 ©
YAML Key Feature: Table Routing
DataSource
mydb.orders_1
mydb.orders_2
mydb. orders_3
mydb.orders_1
mydb.orders_2
mydb.orders_3
Schema
Operator
DataSink
Router
SchemaRegistry
MetadataApplier
ods.oders
Change destination database / table name
Table merge with multi-to-one rule
Route with custom rule in regex expression
Example: mydb.orders_.* ods.orders
Routing Usages:
Flink Forward 2024 ©
YAML Key Feature: Table Routing
my_db
$> flink-cdc.sh mysql-to-doris-route.yaml
user01
name age
Aki 34
Alice 30
ods_db
user02
name age
Bob 13
Ben 32
user03
name age
Paul 45
Tony 24
users
name age
Paul 45
Tony 24
Aki 34
Alice 30
Bob 13
Ben 32
Flink Forward 2024 ©
YAML Key Feature: Transform (Projection,Filter,UDF)
Data Source Schema Operator Router Data Sink
PreTransform PostTransform
Trim unused
columns
Evaluate calculated
columns
Filter out rows
Original Schema Pre-Transformed
Schema
Transformed Schema
Flink Forward 2024 ©
YAML Key Feature: Transform (Projection,Filter,UDF)
app_db
$> flink-cdc.sh mysql-to-doris-filter-orders.yaml
orders
id price amount
1 4 1
2 100 1
3 23 5
filtered_orders
id price amount
2 100 1
3 23 5
ods_db
mysql-to-doris-filter-orders.yaml
Flink Forward 2024 ©
Community and Future Plan
Flink Forward 2024 ©
Our Community
Commits
1000+
1099
714
54%
Stars
5000+
5504
4292
28%
Contributors
100+
143
101
41%
Flink Forward 2024 ©
Future Plan
Support batch pipeline
Support AI Model in Transform
Support more external systems, e.g. Iceberg and ClickHouse
Support more schema change types and data types
Expanding more Scenarios
Stability Improvement
Configurable exception handling, ratelimiting
Compatible with more Flink versions
Bump SDK versions like Debezium,OBLogproxy,TiCDC
Flink Forward 2024 ©
Join Us In Community
➢ Share same contribution steps as Apache Flink
➢ Have independent document website:
https://nightlies.apache.org/flink/flink-cdc-docs-stable
➢ Discuss in Flink mailing list: dev@flink.apache.org / user@flink.apache.org
➢ Create issue on Apache JIRA: https://issues.apache.org/jira
➢ Submit PRs on GitHub: https://github.com/apache/flink-cdc
As the most active sub-project of Apache Flink, (we)
Flink Forward 2024 ©
TAHNK YOU
Ad

More Related Content

What's hot (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
HostedbyConfluent
 
Processing IoT Data from End to End with MQTT and Apache Kafka
Processing IoT Data from End to End with MQTT and Apache Kafka Processing IoT Data from End to End with MQTT and Apache Kafka
Processing IoT Data from End to End with MQTT and Apache Kafka
confluent
 
on log messages
on log messageson log messages
on log messages
Laurence Chen
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
HostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
Apache Pulsar @Splunk
Apache Pulsar @SplunkApache Pulsar @Splunk
Apache Pulsar @Splunk
Karthik Ramasamy
 
IPFS: A Whole New World
IPFS: A Whole New WorldIPFS: A Whole New World
IPFS: A Whole New World
ArcBlock
 
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition! Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Michel Schudel
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
DevNation Live: Kafka and Debezium
DevNation Live: Kafka and DebeziumDevNation Live: Kafka and Debezium
DevNation Live: Kafka and Debezium
Red Hat Developers
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
Akihiro Suda
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
Building an Interactive Query Service in Kafka Streams With Bill Bejeck | Cur...
HostedbyConfluent
 
Processing IoT Data from End to End with MQTT and Apache Kafka
Processing IoT Data from End to End with MQTT and Apache Kafka Processing IoT Data from End to End with MQTT and Apache Kafka
Processing IoT Data from End to End with MQTT and Apache Kafka
confluent
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...
HostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
confluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
IPFS: A Whole New World
IPFS: A Whole New WorldIPFS: A Whole New World
IPFS: A Whole New World
ArcBlock
 
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition! Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Battle Of The Microservice Frameworks: Micronaut versus Quarkus edition!
Michel Schudel
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
DevNation Live: Kafka and Debezium
DevNation Live: Kafka and DebeziumDevNation Live: Kafka and Debezium
DevNation Live: Kafka and Debezium
Red Hat Developers
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
Akihiro Suda
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
Guozhang Wang
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 

Similar to Exploring Scenarios of Flink CDC in Streaming Data Integration (20)

Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
StreamNative
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
Timothy Spann
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
Select Star: Unified Batch & Streaming with Flink SQL & Pulsar
Select Star: Unified Batch & Streaming with Flink SQL & PulsarSelect Star: Unified Batch & Streaming with Flink SQL & Pulsar
Select Star: Unified Batch & Streaming with Flink SQL & Pulsar
Caito Scherr
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
Gyula Fóra
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
NomanKhalid56
 
Database CI/CD Pipeline
Database CI/CD PipelineDatabase CI/CD Pipeline
Database CI/CD Pipeline
muhammadhashir57
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
Continuent
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Борис Трофимов. Continuous Database migration-это просто!
Борис Трофимов. Continuous Database migration-это просто!Борис Трофимов. Continuous Database migration-это просто!
Борис Трофимов. Continuous Database migration-это просто!
Volha Banadyseva
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
b0ris_1
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
Learntek1
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
babatunde ekemode
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) REST
coliquio GmbH
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
Oracle Apex Technical Introduction
Oracle Apex   Technical IntroductionOracle Apex   Technical Introduction
Oracle Apex Technical Introduction
crokitta
 
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
StreamNative
 
Oracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud InfrastructureOracle Database Migration to Oracle Cloud Infrastructure
Oracle Database Migration to Oracle Cloud Infrastructure
SinanPetrusToma
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
Timothy Spann
 
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward
 
Select Star: Unified Batch & Streaming with Flink SQL & Pulsar
Select Star: Unified Batch & Streaming with Flink SQL & PulsarSelect Star: Unified Batch & Streaming with Flink SQL & Pulsar
Select Star: Unified Batch & Streaming with Flink SQL & Pulsar
Caito Scherr
 
Apache Flink: Past, Present and Future
Apache Flink: Past, Present and FutureApache Flink: Past, Present and Future
Apache Flink: Past, Present and Future
Gyula Fóra
 
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Forward
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
5675212318661411677_TRN4034_How_to_Migrate_to_Oracle_Autonomous_Database_Clou...
NomanKhalid56
 
Replicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
Continuent
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
Amy W. Tang
 
Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016Apache Flink Berlin Meetup May 2016
Apache Flink Berlin Meetup May 2016
Stephan Ewen
 
Борис Трофимов. Continuous Database migration-это просто!
Борис Трофимов. Continuous Database migration-это просто!Борис Трофимов. Continuous Database migration-это просто!
Борис Трофимов. Continuous Database migration-это просто!
Volha Banadyseva
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
b0ris_1
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
Learntek1
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
babatunde ekemode
 
GraphQL vs. (the) REST
GraphQL vs. (the) RESTGraphQL vs. (the) REST
GraphQL vs. (the) REST
coliquio GmbH
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
Oracle Apex Technical Introduction
Oracle Apex   Technical IntroductionOracle Apex   Technical Introduction
Oracle Apex Technical Introduction
crokitta
 
Ad

Recently uploaded (20)

Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Ad

Exploring Scenarios of Flink CDC in Streaming Data Integration

  • 1. Flink Forward 2024 © Exploring Scenarios of Flink CDC in Streaming Data Integration Leonard Xu @ Alibaba Cloud Flink PMC Member & Committer, Flink CDC Lead
  • 2. Flink Forward 2024 © Flink CDC Overview 01 Why Flink CDC YAML 02 CDC YAML Internals 03 Community and Future Plan 04
  • 3. Flink Forward 2024 © Flink CDC Overview
  • 4. Flink Forward 2024 © What is Flink CDC Flink CDC is a streaming data integration tool that implements unified snapshot reading and incremental reading based on the CDC (Change Data Capture) technology of database logs. Combined with Flink's excellent pipeline capabilities and rich upstream and downstream ecosystem, Flink CDC can efficiently achieve real-time integration of massive data. Flink CDC Real-time Snapshot Data Snapshot Change Log
  • 5. Flink Forward 2024 © Usages of Flink CDC Data Synchronization Real-time Materialized view Data Distribution Data Integration Flink CDC
  • 6. Flink Forward 2024 © Traditional CDC Pipeline DataX / Sqoop Snapshot Sync Debezium / Canal Changelog Sync Merge Merged Table Incremental Table Snapshot Table Data INFRAs Data Consistency Data Freshness Data Stack 😔 😔 😔 😔 DB
  • 7. Flink Forward 2024 © With Flink CDC Canal / Debezium Changelog Sync DataX / Sqoop Snapshot Sync Real-time Snapshot CDC Source Sink Custom Logics … Merge Merged Table Incremental Table Snapshot Table Unified Sync Exactly-Once Low Latency One Flink Job 😄 😄 😄 😄
  • 8. Flink Forward 2024 © Our Milestones 2020 / 07 Kick Off 2021 / 08 Release 2.0 First version of MySQL CDC and Postgres CDC Connector 2022 / 11 Release 2.3 MySQL CDC implements Incremental Snapshot Algorithm 2023 / 10 Release 3.0 YAML API, end to end streaming data integration tool 2024 / 01 Donated to ASF Supports transform (Projection,Filter, UDF) in YAML API 2024 / 09 Release 3.2 Incremental Snapshot Framework, cover key connectors
  • 9. Flink Forward 2024 © Transform(T) Load(L) Extraction(E) Flink Flink Debezium TiDB ClickHouse Iceberg Hudi Paimon TiDB ClickHouse Iceberg Hudi Paimon MySQL CDC Source Flink CDC 1.x: CDC Source Connector
  • 10. Flink Forward 2024 © 2) Scan Snapshot Data of table 4) Append Changelog of table JDBC connection Binlog connection DB 1) Lock table for data consistency 3) Release table lock after scan Flink CDC 1.x: CDC Source Connector
  • 11. Flink Forward 2024 © Flink CDC 2.0: Incremental Snapshot Algorithm DB JDBC con JDBC con JDBC con binlog con No-Lock Algorithm Task1 Task2 Task3 Parallel Snapshot Scan Changelog dump Auto Switch chunk2 chunk3 chunk1 Task4
  • 12. Flink Forward 2024 © Flink CDC 2.x: Incremental Snapshot Framework No-Lock Algorithm Task1 Task2 Task3 Parallel Snapshot Scan Changelog dump Auto Switch chunk2 chunk3 chunk1 Task4 ApsaraDB MySQL More sources are on the way
  • 13. Flink Forward 2024 © Flink CDC 2.x Recap Scalability Copyright Integrity ● Only source,not a entire data pipeline ● SQL job’s schema is fixed, each table need an operator ● What if upstream schema changed ● What if upstream tables added ● Sync thousands of tables once ● Sync Entire DB in one pipeline ● Apache V2 License ● Project belongs to Ververica Maintainability
  • 14. Flink Forward 2024 © Flink CDC 3.0 Motivation Automation Flexible Donation End to End ● End to End data pipeline ● YAML API, simple but powerful ● Automated schema evolution ● Automated newly table capture ● Full DB sync via regular expression ● Flush to multiple tables in one sink ● Donate Flink CDC to Apache Flink ● Project belongs to ASF
  • 15. Flink Forward 2024 © Flink CDC 3.0: End to end streaming data integration tool AI / ML Analytics / BI Database Data Lake Data Warehouse Flink CDC
  • 16. Flink Forward 2024 © $> flink-cdc.sh mysql-to-doris.yaml mysql-to-doris.yaml Simple but Powerful ! Flink CDC 3.0: End to end streaming data integration tool
  • 17. Flink CDC 3.0:Donate to Apache Flink
  • 18. Flink Forward 2024 © Why Flink CDC YAML
  • 19. Flink Forward 2024 © YAML API:Design for Data Integration Users Users Don’t Care How it works Users Care What they Need Paimon
  • 20. Flink Forward 2024 © Write Pipeline via YAML API YAML Language Easy to write for user Friendly for machine transmission Focus on Data Integration Scenarios Just specify sync source and destination Routing and transformations No PhD in Flink needed
  • 21. Flink Forward 2024 © TiDB Hologres ClickHouse Iceberg Hudi … TiDB ApsaraDB MySQL Flink CDC APIs SELECT WHERE JOIN Top-N Flink SQL API GROUP BY INSERT map filter join Flink DataStream API keyBy flatMap Schema Evolution Schema Sync SELECT Filter CDC YAML API Full DB Sync UDF aggregate
  • 22. Flink Forward 2024 © Flink CDC APIs Flink Runtime Flink CDC Runtime CDC Sources SQL API YAML API YAML Sinks CDC client (TiDB,MongoDB…) Debezium (MySQL,PG,Oracle..) DataStream API
  • 23. Flink Forward 2024 © RowData Delete Insert Update Before Update After Flink SQL Pipeline DataChangeEvent Delete Insert Update Before After CreateTableEvent SchemaChangeEvent AddColumnEvent TruncateTableEvent … CDC YAML Pipeline SQL Pipeline vs YAML Pipeline
  • 24. Flink Forward 2024 © Flink SQL Pipeline CDC YAML Pipeline Manually write create table and insert into 😔 Cannot process schema change 😔 Break original changelog Update 😔 Read/Write single table in TableSource/TableSink operator 😔 Schema discovery, Full DB synchronize 😊 Schema Evolution with multiple strategies 😊 Original changelog synchronize 😊 Read/Write multiple tables in Source/Sink 😊 SQL Pipeline vs YAML Pipeline
  • 25. Flink Forward 2024 © StreamRecord Flink DataStream Pipeline DataChangeEvent CreateTableEvent SchemaChangeEvent CDC YAML Pipeline op (user, STRING) (id, INT) (user, STRING) (id, INT) +I Leonard 1 op (user, STRING) (id, INT) U Leonard 1 3 op (user, STRING) (id, INT) -D Leonard 3 +I Leonard 1 -D Leonard 3 BinaryData + Schemaless DataStream Pipeline vs YAML Pipeline U Leonard 1 3
  • 26. Flink Forward 2024 © Flink DataStream Pipeline CDC YAML Pipeline DataStream Pipeline vs YAML Pipeline Java expert required, distributed system programing 😔 Flink expert required,DataStream API, State, Checkpoint, Runtime 😔 Skills for Maven、 Dependency management 😔 Hard to reuse even you’ve implemented one 😔 Design for all users instead of experts 😊 Build powerful pipeline via YAML, underlying details is hidden 😊 YAML is easy to understand and learn 😊 Easy to create new pipeline with Ctrl + C/V 😊
  • 27. Flink Forward 2024 © CDC YAML Internals
  • 28. YAML Overall Design YAML Flink CDC CLI Flink CDC Composer Streaming Pipeline Schema Evolution Full DB Sync Sharded Table Sync Change Data Capture Batch Pipeline Schema Inference Flink CDC Runtime DataSource Operator DataSink Operator Schema Registry Router Transformer Flink Runtime YARN Kubernetes Standalone Flink CDC Connect MySQL Source Doris Sink StarRocks Sink … Flink CDC Composer Flink CDC API
  • 29. Flink Forward 2024 © YAML Connector API DataSource MetadataAccessor EventSourceProvider FlinkSourceProvider FlinkSourceFunctionProvider DataSink EventSinkProvider FlinkSinkProvider FlinkSinkFunctionProvider MetadataApplier
  • 30. Flink Forward 2024 © YAML Key Feature: Schema Evolution SchemaChangeEvent DataChangeEvent FlushEvent SchemaRegistry 5⃣ Sink notifies flush complete 6⃣ Schema registry applies schema change 1⃣ Schema operator receives SchemaChangeEvent 2⃣ Schema operator registers schema change then wait for response (hold upstream). Blocks if SchemaRegistry is busy 3⃣ Schema registry accepts schema change, and rejects following requests 4⃣ Schema operator broadcasts FlushEvent, requests registry again to wait for flush complete 7⃣ Schema registry confirms schema evolution completes, ready for next request 8⃣ Schema operator releases upstream 2⃣ ther schema operator must wait until other schema changes are completes Schema Operator DataSink Post Partitioner DataSource MetadataApplier APPLYING IDLE FINISHED WAITING Paimon
  • 31. Flink Forward 2024 © YAML Key Feature: Fine-grained Schema Evolution Paimon Lenient Mode (Default) Keeps data integrity, and provides recoverability Ignore Ignores any schema changes Try Evolve Tries to apply schema changes, cast data records if fails Evolve Apply schema changes, terminates the job if fails Exception Rejects any schema changes, terminates once occurred SchemaRegistry Table Upstream Schema db.table1 {id INT, …} Table Evolved Schema db.table1 {id INT, …}
  • 32. Flink Forward 2024 © $> flink-cdc.sh mysql-to-doris.yaml mysql-to-doris.yaml app_db products id product 1 Beer 2 Cap 3 Peanut orders id price amount 1 4 null 2 100 null 3 23 234 shipments id product 1 Beer 2 Cap 3 Peanut products id product 1 Beer 2 Cap 3 Peanut orders id price amount 1 4 null 2 100 null 3 23 234 shipments id product 1 Beer 2 Cap 3 Peanut app_db Simple but Powerful ! YAML Key Feature: Schema Evolution
  • 33. Flink Forward 2024 © YAML Key Feature: Table Routing DataSource mydb.orders_1 mydb.orders_2 mydb. orders_3 mydb.orders_1 mydb.orders_2 mydb.orders_3 Schema Operator DataSink Router SchemaRegistry MetadataApplier ods.oders Change destination database / table name Table merge with multi-to-one rule Route with custom rule in regex expression Example: mydb.orders_.* ods.orders Routing Usages:
  • 34. Flink Forward 2024 © YAML Key Feature: Table Routing my_db $> flink-cdc.sh mysql-to-doris-route.yaml user01 name age Aki 34 Alice 30 ods_db user02 name age Bob 13 Ben 32 user03 name age Paul 45 Tony 24 users name age Paul 45 Tony 24 Aki 34 Alice 30 Bob 13 Ben 32
  • 35. Flink Forward 2024 © YAML Key Feature: Transform (Projection,Filter,UDF) Data Source Schema Operator Router Data Sink PreTransform PostTransform Trim unused columns Evaluate calculated columns Filter out rows Original Schema Pre-Transformed Schema Transformed Schema
  • 36. Flink Forward 2024 © YAML Key Feature: Transform (Projection,Filter,UDF) app_db $> flink-cdc.sh mysql-to-doris-filter-orders.yaml orders id price amount 1 4 1 2 100 1 3 23 5 filtered_orders id price amount 2 100 1 3 23 5 ods_db mysql-to-doris-filter-orders.yaml
  • 37. Flink Forward 2024 © Community and Future Plan
  • 38. Flink Forward 2024 © Our Community Commits 1000+ 1099 714 54% Stars 5000+ 5504 4292 28% Contributors 100+ 143 101 41%
  • 39. Flink Forward 2024 © Future Plan Support batch pipeline Support AI Model in Transform Support more external systems, e.g. Iceberg and ClickHouse Support more schema change types and data types Expanding more Scenarios Stability Improvement Configurable exception handling, ratelimiting Compatible with more Flink versions Bump SDK versions like Debezium,OBLogproxy,TiCDC
  • 40. Flink Forward 2024 © Join Us In Community ➢ Share same contribution steps as Apache Flink ➢ Have independent document website: https://nightlies.apache.org/flink/flink-cdc-docs-stable ➢ Discuss in Flink mailing list: [email protected] / [email protected] ➢ Create issue on Apache JIRA: https://issues.apache.org/jira ➢ Submit PRs on GitHub: https://github.com/apache/flink-cdc As the most active sub-project of Apache Flink, (we)
  • 41. Flink Forward 2024 © TAHNK YOU