Exploring Scenarios of Flink CDC in Streaming Data Integration

Flink Forward 2024 ©
Exploring Scenarios of Flink CDC
in Streaming Data Integration
Leonard Xu @ Alibaba Cloud
Flink PMC Member & Committer, Flink CDC Lead

Flink CDC Overview
01
Why Flink CDC YAML
02
CDC YAML Internals
03
Community and Future Plan
04

Flink CDC Overview

What is Flink CDC
Flink CDC is a streaming data integration tool that implements unified snapshot reading and
incremental reading based on the CDC (Change Data Capture) technology of database logs.
Combined with Flink's excellent pipeline capabilities and rich upstream and downstream
ecosystem, Flink CDC can efficiently achieve real-time integration of massive data.
Flink CDC
Real-time Snapshot
Data Snapshot
Change Log

Usages of Flink CDC
Data
Synchronization
Real-time
Materialized view
Data Distribution Data Integration
Flink CDC

Traditional CDC Pipeline
DataX / Sqoop
Snapshot Sync
Debezium / Canal
Changelog Sync
Merge
Merged Table
Incremental Table
Snapshot Table
Data INFRAs
Data
Consistency
Data Freshness Data Stack
😔 😔 😔 😔
DB

With Flink CDC
Canal / Debezium
Changelog Sync
DataX / Sqoop
Snapshot Sync
Real-time
Snapshot
CDC Source Sink
Custom Logics
…
Merge
Merged Table
Incremental Table
Snapshot Table
Unified Sync Exactly-Once Low Latency One Flink Job
😄 😄 😄 😄

Our Milestones
2020 / 07
Kick Off
2021 / 08
Release 2.0
First version of MySQL CDC and Postgres CDC Connector
2022 / 11
Release 2.3
MySQL CDC implements Incremental Snapshot Algorithm
2023 / 10
Release 3.0
YAML API, end to end streaming data integration tool
2024 / 01
Donated to ASF
Supports transform (Projection,Filter, UDF) in YAML API
2024 / 09
Release 3.2
Incremental Snapshot Framework, cover key connectors

Transform(T) Load(L)
Extraction(E)
Flink
Flink
Debezium
TiDB
ClickHouse
Iceberg
Hudi
Paimon
TiDB
ClickHouse
Iceberg
Hudi
Paimon
MySQL CDC Source
Flink CDC 1.x: CDC Source Connector

2) Scan Snapshot Data of table
4) Append Changelog of table
JDBC connection
Binlog
connection
DB
1) Lock table for data consistency
3) Release table lock after scan
Flink CDC 1.x: CDC Source Connector

Flink CDC 2.0: Incremental Snapshot Algorithm
DB
JDBC con
JDBC con
JDBC con
binlog con
No-Lock Algorithm
Task1
Task2
Task3
Parallel Snapshot Scan Changelog dump
Auto Switch
chunk2
chunk3
chunk1
Task4

Flink CDC 2.x: Incremental Snapshot Framework
No-Lock Algorithm
Task1
Task2
Task3
Parallel Snapshot Scan Changelog dump
Auto Switch
chunk2
chunk3
chunk1
Task4
ApsaraDB MySQL
More sources are on the way

Flink CDC 2.x Recap
Scalability Copyright
Integrity
● Only source，not a
entire data pipeline
● SQL job’s schema is
fixed, each table
need an operator
● What if upstream
schema changed
● What if upstream
tables added
● Sync thousands of
tables once
● Sync Entire DB in
one pipeline
● Apache V2 License
● Project belongs to
Ververica
Maintainability

Flink CDC 3.0 Motivation
Automation Flexible Donation
End to End
● End to End data
pipeline
● YAML API, simple
but powerful
● Automated schema
evolution
● Automated newly
table capture
● Full DB sync via
regular expression
● Flush to multiple
tables in one sink
● Donate Flink CDC
to Apache Flink
● Project belongs to
ASF

Flink CDC 3.0: End to end streaming data integration tool
AI / ML
Analytics / BI
Database
Data Lake
Data Warehouse
Flink CDC

$> flink-cdc.sh mysql-to-doris.yaml
mysql-to-doris.yaml
Simple but Powerful ！
Flink CDC 3.0: End to end streaming data integration tool

Flink CDC 3.0：Donate to Apache Flink

Why Flink CDC YAML

YAML API：Design for Data Integration Users
Users Don’t Care How it works
Users Care What they Need
Paimon

Write Pipeline via YAML API
YAML Language
Easy to write for user
Friendly for machine transmission
Focus on Data Integration Scenarios
Just specify sync source and destination
Routing and transformations
No PhD in Flink needed

TiDB
Hologres
ClickHouse
Iceberg
Hudi
…
TiDB
ApsaraDB MySQL
Flink CDC APIs
SELECT WHERE
JOIN Top-N
Flink SQL API
GROUP BY
INSERT
map filter
join
Flink DataStream API
keyBy
flatMap
Schema
Evolution
Schema
Sync
SELECT Filter
CDC YAML API
Full DB
Sync
UDF
aggregate

Flink CDC APIs
Flink Runtime
Flink CDC Runtime
CDC Sources
SQL API YAML API
YAML Sinks
CDC client
(TiDB,MongoDB…)
Debezium
(MySQL,PG,Oracle..)
DataStream API

RowData
Delete
Insert
Update
Before
Update
After
Flink SQL Pipeline
DataChangeEvent
Delete
Insert
Update
Before
After
CreateTableEvent
SchemaChangeEvent
AddColumnEvent
TruncateTableEvent
…
CDC YAML Pipeline
SQL Pipeline vs YAML Pipeline

Flink SQL Pipeline CDC YAML Pipeline
Manually write create table and insert
into
😔
Cannot process schema change
😔
Break original changelog Update
😔
Read/Write single table in
TableSource/TableSink operator
😔
Schema discovery, Full DB synchronize
😊
Schema Evolution with multiple strategies
😊
Original changelog synchronize
😊
Read/Write multiple tables in Source/Sink
😊
SQL Pipeline vs YAML Pipeline

StreamRecord
Flink DataStream Pipeline
DataChangeEvent
CreateTableEvent
SchemaChangeEvent
CDC YAML Pipeline
op (user, STRING) (id, INT)
(user, STRING) (id, INT)
+I Leonard 1
U Leonard 1 3
-D Leonard 3
+I Leonard 1
-D Leonard 3
BinaryData + Schemaless
DataStream Pipeline vs YAML Pipeline
U Leonard 1 3

Flink DataStream Pipeline CDC YAML Pipeline
DataStream Pipeline vs YAML Pipeline
Java expert required, distributed system
programing
😔
Flink expert required，DataStream API，
State, Checkpoint, Runtime
😔
Skills for Maven、 Dependency
management
😔
Hard to reuse even you’ve implemented
one
😔
Design for all users instead of experts
😊
Build powerful pipeline via YAML,
underlying details is hidden
😊
YAML is easy to understand and learn
😊
Easy to create new pipeline with Ctrl + C/V
😊

CDC YAML Internals

YAML Overall Design
YAML
Flink CDC CLI
Flink CDC Composer
Streaming Pipeline Schema Evolution Full DB Sync Sharded Table Sync
Change Data Capture Batch Pipeline Schema Inference
Flink CDC Runtime DataSource Operator DataSink Operator Schema Registry Router Transformer
Flink Runtime
YARN
Kubernetes Standalone
Flink CDC Connect MySQL Source Doris Sink StarRocks Sink …
Flink CDC Composer
Flink CDC API

YAML Connector API
DataSource
MetadataAccessor
EventSourceProvider
FlinkSourceProvider FlinkSourceFunctionProvider
DataSink
EventSinkProvider
FlinkSinkProvider FlinkSinkFunctionProvider
MetadataApplier

YAML Key Feature: Schema Evolution
SchemaChangeEvent
DataChangeEvent
FlushEvent
SchemaRegistry
5⃣ Sink notifies flush complete
6⃣ Schema registry applies schema change
1⃣ Schema operator receives
SchemaChangeEvent
2⃣ Schema operator registers schema
change then wait for response (hold
upstream). Blocks if SchemaRegistry is
busy
3⃣ Schema registry
accepts schema change, and
rejects following requests
4⃣ Schema operator broadcasts
FlushEvent, requests registry
again to wait for flush complete
7⃣ Schema registry
confirms schema evolution
completes, ready for next request
8⃣ Schema operator
releases upstream
2⃣ ther schema operator must wait until
other schema changes are completes
Schema Operator DataSink
Post Partitioner
DataSource
MetadataApplier
APPLYING
IDLE
FINISHED
WAITING
Paimon

YAML Key Feature: Fine-grained Schema Evolution
Paimon
Lenient Mode (Default)
Keeps data integrity, and provides recoverability
Ignore
Ignores any schema changes
Try Evolve
Tries to apply schema changes, cast data records if fails
Evolve
Apply schema changes, terminates the job if fails
Exception
Rejects any schema changes, terminates once occurred
SchemaRegistry
Table Upstream
Schema
db.table1 {id INT, …}
Table Evolved
Schema
db.table1 {id INT, …}

$> flink-cdc.sh mysql-to-doris.yaml
mysql-to-doris.yaml
app_db
products
id product
1 Beer
2 Cap
3 Peanut
orders
id price amount
1 4 null
2 100 null
3 23 234
shipments
id product
1 Beer
2 Cap
3 Peanut
products
id product
1 Beer
2 Cap
3 Peanut
orders
id price amount
1 4 null
2 100 null
3 23 234
shipments
id product
1 Beer
2 Cap
3 Peanut
app_db
Simple but Powerful ！
YAML Key Feature: Schema Evolution

YAML Key Feature: Table Routing
DataSource
mydb.orders_1
mydb.orders_2
mydb. orders_3
mydb.orders_1
mydb.orders_2
mydb.orders_3
Schema
Operator
DataSink
Router
SchemaRegistry
MetadataApplier
ods.oders
Change destination database / table name
Table merge with multi-to-one rule
Route with custom rule in regex expression
Example: mydb.orders_.* ods.orders
Routing Usages:

YAML Key Feature: Table Routing
my_db
$> flink-cdc.sh mysql-to-doris-route.yaml
user01
name age
Aki 34
Alice 30
ods_db
user02
name age
Bob 13
Ben 32
user03
name age
Paul 45
Tony 24
users
name age
Paul 45
Tony 24
Aki 34
Alice 30
Bob 13
Ben 32

YAML Key Feature: Transform (Projection,Filter,UDF)
Data Source Schema Operator Router Data Sink
PreTransform PostTransform
Trim unused
columns
Evaluate calculated
columns
Filter out rows
Original Schema Pre-Transformed
Schema
Transformed Schema

YAML Key Feature: Transform (Projection,Filter,UDF)
app_db
$> flink-cdc.sh mysql-to-doris-filter-orders.yaml
orders
id price amount
1 4 1
2 100 1
3 23 5
filtered_orders
id price amount
2 100 1
3 23 5
ods_db
mysql-to-doris-filter-orders.yaml

Community and Future Plan

Our Community
Commits
1000+
1099
714
54%
Stars
5000+
5504
4292
28%
Contributors
100+
143
101
41%

Future Plan
Support batch pipeline
Support AI Model in Transform
Support more external systems, e.g. Iceberg and ClickHouse
Support more schema change types and data types
Expanding more Scenarios
Stability Improvement
Configurable exception handling, ratelimiting
Compatible with more Flink versions
Bump SDK versions like Debezium,OBLogproxy,TiCDC

Join Us In Community
➢ Share same contribution steps as Apache Flink
➢ Have independent document website:
https://nightlies.apache.org/flink/flink-cdc-docs-stable
➢ Discuss in Flink mailing list: dev@flink.apache.org / user@flink.apache.org
➢ Create issue on Apache JIRA: https://issues.apache.org/jira
➢ Submit PRs on GitHub: https://github.com/apache/flink-cdc
As the most active sub-project of Apache Flink, (we)

TAHNK YOU

Exploring Scenarios of Flink CDC in Streaming Data Integration

Recommended

More Related Content

What's hot (20)

Similar to Exploring Scenarios of Flink CDC in Streaming Data Integration (20)

Recently uploaded (20)

Exploring Scenarios of Flink CDC in Streaming Data Integration