3. HyunSoo Kim
Senior Solutions Engineer
Confluent Korea
Junhee Shin
Solutions Engineer
Confluent Korea
Todayโs Hosts and Speakers
Jupil Hwang
Senior Solutions Engineer
Confluent Korea
13. 13
โฆ.ํ์ง๋ง ๋ฐ์ดํฐ ์คํธ๋ฆฌ๋ฐ ํ๋ซํผ์ด ์์ผ๋ฉด ์๋ชป๋ ๋ฐ์ดํฐ๊ฐ ์กฐ์ง ์ ์ฒด์
ํผ์ ธ ๋๊ฐ๋๋ค
๋ง์น ํธ์ ์์ ์๋ ์ง์ ์งํ
๋ฐ์๊ตญ์ ๋จ๊ธฐ๋ ๊ฒ๊ณผ ๋ง์ฐฌ๊ฐ์ง์ฃ !
Data Warehouse Data Lake โLakehouseโ
Scalable and high
performance for queries
and historical analyses
Scalable and flexible
for storing
unstructured data
Combines the advantages
of DWH and DL
14. ์ค๋๋ ์ ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ ์ ๊ทผ ๋ฐฉ์์
๋ฐ์ดํฐ ๋ฌธ์ ์ ๊ทผ๋ณธ ์์ธ์ ๋๋ค
Domain 1
Database
Domain 2
Database
Domain 3
Database
Data Lake
Lake House
Data Mart
Data
Warehouse
ML/AI
Reports &
Dashboards
Domain 4
Database
OPERATIONAL
SYSTEMS
ETL/ELT PIPELINES
ANALYTICAL
SYSTEMS
15. DATA WAREHOUSE / DATA LAKE
ML/AI
Dashboards
OPERATIONAL DATA
Poor decision making
with stale data
5 / 30 / 60 min batch ingestion
Poor lineage and governance
and increasing pipeline sprawl
Cascading data pollution and failures
Time
Batch 1
Process
Batch 2
Process
Batch 3
Process
Batch 4
Process
Time
Batch 1
Process
Batch 2
Process
Batch 3
Process
Batch 4
Process
Time
Batch 1
Process
Batch 2
Process
Batch 3
Process
Batch 4
Process
Time
Batch 1
Process
Batch 2
Process
Batch 3
Process
Batch 4
Process
Complex remodelling and reprocessing = $$$
โJUST-ENOUGHโ
CLEANSED DATA
READY-TO-USE
BUSINESS DATA
RAW DATA
DUMPS
ANIMATED SLIDE
Reports
ELT ํ์ดํ๋ผ์ธ์ ์ทจ์ฝํ๊ณ ๋๋ฆฌ๋ฉฐ ๋นํจ์จ์ ์ ๋๋ค
16. Domain 1
Database
Domain 2
Database
Domain 3
Database
Data Lake
Lake House
Data Mart
Data
Warehouse
ML/AI
Reports &
Dashboards
Domain 4
Database
OPERATIONAL
SYSTEMS
ETL/ELT PIPELINES
ANALYTICAL
SYSTEMS
REVERSE ETL
More batch tools are bolted on
to reverse the flow of data โ from
data warehouses and data lakes
back to operational systems and
apps โ for โreal-timeโ use cases
์ต์ ์ ํ๋ฆฌ์ผ์ด์ ์์๋ ๋ฐ์ดํฐ๊ฐ โUpstream'๋ก ํ๋ฅด๋๋ก ํด์ผ
ํ๋ ๊ฒฝ์ฐ๋ ์กด์ฌํฉ๋๋ค
19. Confluent ๋ฐ์ดํฐ ์คํธ๋ฆฌ๋ฐ ํ๋ซํผ์ ์ฅ์
Streaming
Continuously capture and share
real-time data everywhere - to
your data warehouse, data lake and
operational systems and apps
Schema Management
Reduce faulty data downstream
by enforcing quality checks
and controls in the pipeline
with data contracts
Flink
Continuously process real-time data,
the moment itโs created, for well-
curated
reusable data products
Data Portal
Enable anyone with the right
access controls to effortlessly
explore and use real-time
data products for greater
data autonomy
Tableflow
Simplify representing
your operational data as a
ready-to-use Iceberg table
in just one-click
Stream Lineage
Understand the complex
data relationships and the
data journey to ensure
trustworthiness
Focus of todayโs session
20. How Shift Left Works
๋ฐ์ดํฐ๋ฅผ ํ ๋ฒ ์ฐ๊ณ ์คํธ๋ฆผ์ด๋ ํ ์ด๋ธ๋ก ์ฝ์ด๋ณด์ธ์
Stream processing
(Focus of todayโs session)
Data Stream Data Product
Schema Registry
Tableflow
(Iceberg)
Third Party Compute
Engines
Databases
Log data &
messaging systems
Custom Apps &
Microservices
Operational Apps &
Data Systems
Stream (Kafka)
Event-Driven
Design
Decoupled
Architecture
Connect
Connect
Connect
Data Warehouses /
Data Lakes
Stream (Kafka)
COMING
SOON
READ
AS
READ
AS
Stream
Lineage
Stream
Catalog
Data
Portal
Immutable
Logs
Enterprise Resource
Planning systems
Connect
21. Reduce DWH / DL costs by
ingesting data from operational
systems and apps, attaching
schema, and processing it with
Flink, in order to share high-
quality streams to analytics
systems (e.g., SNOW, DBricks) in
real time
Continuously analyze and update
results as data streams are
produced for real-time
dashboarding via a RT analytics
DB (e.g., Druid, Rockset, Pinot)
โ Ad/campaign performance
โ Content performance
โ Quality monitoring of Telco
networks
โ Large-scale graph analysis
Analyze data streams over time
windows to detect patterns and
react to incoming events by
triggering computations, state
updates, or external actions (i.e.,
microservices)
Description
Sample Use
Cases
(Technical
and Business)
Category
Real-time Analytics
โ Real-time search index building
โ ML pipelines
โ Data warehouse modernization
โ Database modernization
โ Data lake ingestion
โ Reporting and analytics
Data Pipelines (โShift
Leftโ)
Event Driven Applications
โ Fraud detection
โ Anomaly detection
โ Alerting/notifications
โ Routing
โ Business process monitoring
โ Bad experience detection
์คํธ๋ฆผ ์ฒ๋ฆฌ๋ฅผ ํตํด ๋น์ฆ๋์ค ๊ฐ์น์ ๊ด๋ จ๋ ๊ด๋ฒ์ํ ์ฌ์ฉ
์ฌ๋ก๋ฅผ ์ง์ํฉ๋๋ค
22. Kafka Streams ksqlDB
Kafka ecosystem
Client library deployed to Java Runtime.
Self hosted. Input and output data are
stored in single Kafka cluster.
Java
Open Source
Standalone SQL engine built on top of
Kafka Streams. Input and output data are
stored in single Kafka cluster
SQL
Community Source
Flink
Flink
Framework and distributed engine for
stateful computations over unbounded
and bounded data streams
Java, Python, SQL
Open Source
22
Kafka๋ฅผ ์ํ ์คํธ๋ฆผ ์ฒ๋ฆฌ(Stream Processing)
23. Real-time
Data
A Sale
A Shipment
A Trade
A Customer
Experience
Real-Time Backend
Operations
Real-time Stream Processing
์ค์๊ฐ ์๋น์ค๋ ์คํธ๋ฆผ ์ฒ๋ฆฌ์ ์์กดํฉ๋๋ค
45. Actions๋ฅผ ์ฌ์ฉํ์ฌ ์ผ๋ฐ์ ์ธ ์ฌ์ฉ ์ฌ๋ก์ ๋ํ ์คํธ๋ฆผ ์ฒ๋ฆฌ
์์ ์ ๋ฐฐํฌ๋ฅผ ๊ฐ์ํํฉ๋๋ค
Deduplicate topic
Generate a topic containing only
unique records from an input topic
Mask fields
Generate a topic containing masked
fields from an input topic
Filter topic
Filter a topic based on a given set of
conditions
Apply a transformation
Transform a topic based on a set of
provided expressions
COMING SOON
Actions provide pre-packaged, turnkey stream
processing workloads that run on Flink
46. ์ฌ์ฉ์ ์ ์ ํจ์๋ฅผ
์ฌ์ฉํ์ฌ Flink SQL
๊ธฐ๋ฅ ํ์ฅ
์ฌ์ฉ์ ์ ์ ํจ์(UDF)๋ฅผ ์ฌ์ฉํ๋ฉด Flink
SQL์์ ๊ธฐ๋ณธ์ ์ผ๋ก ์ง์๋์ง ์๋ ๋ณต์กํ
๋ ผ๋ฆฌ๋ฅผ ๊ตฌํํ๊ธฐ ์ํ ์ฌ์ฉ์ ์ ์ ํจ์๋ฅผ
์์ฑํ ์ ์์ต๋๋ค.
โ ํน์ ์ฌ์ฉ ์ฌ๋ก์ ๋ง์ถฐ ์ฒ๋ฆฌ
โ ์ฌ๋ฌ ์ ํ๋ฆฌ์ผ์ด์ ์์ ์ฌ์ฌ์ฉ
โ ์ ํธํ๋ ํ๋ก๊ทธ๋๋ฐ ์ธ์ด๋ก ์์
Java UDF
SQL query
EARLY ACCESS
NOTE: Early Access is open to a limited number of
candidates. Only Java and scalar functions are supported
initially. Python support planned for 2H โ24.
UDF
arguments
UDF
result
47. Table API(Open Preview)๋ Java ๋๋
Python์์ ํ๋ก๊ทธ๋๋ฐ ๋ฐฉ์์ ์ ์ด๋ฅผ
์ ๊ณตํ์ฌ ๊ธฐ์กด ์ฝ๋๋ฒ ์ด์ค์ ์ํํ๊ฒ ํตํฉํ
์ ์๋ ํ๋ถํ ์ฐ์ฐ ๋ฐ ๋ณํ ๊ธฐ๋ฅ์ ์ ๊ณตํฉ๋๋ค.
โ ์ต์ํ ๊ตฌ์กฐ๋ฅผ ์ฌ์ฉํ์ฌ ์คํธ๋ฆฌ๋ฐ
์ ํ๋ฆฌ์ผ์ด์ ์ ๊ตฌ์ถํฉ๋๋ค.
โ ๋ช ๋ นํ ํ๋ก๊ทธ๋๋ฐ ๋ฐฉ์์ ํ์ฉํฉ๋๋ค.
โ ๊ตฌ์กฐ์ ์ด๊ณ ๊ฐ๋ ฅํ ํ์์ ์ค๊ณ๋ฅผ ํตํด ๊ฐ๋ฐ,
ํ ์คํธ ๋ฐ ์ ์ง ๊ด๋ฆฌ๋ฅผ ๊ฐ์ํํฉ๋๋ค.
Table API ์ง์์
ํตํด ์๋ฒ๋ฆฌ์ค Flink
์ ๊ทผ์ฑ ํ๋
Track status of Table API statements
Use full capabilities of modern IDEs
48. ๊ณ ๊ธ SQL ์คํธ๋ฆฌ๋ฐ ์ฐ์ฐ์
51
Time Windows Pattern Matching Streaming Joins
โ Time-based windows
โ Event-density windows
โ Event-based windows: every
single event can trigger a new
window
โ Complex Event Processing
โ See sample
โ Stream-to-stream joins
โ Temporal joins
โ Lookup joins
โ Versioned joins
49. Fully integrated out of the box
โ Connected via Confluent Connector
โ Environments are Catalogs
โ Kafka Clusters as Databases
โ Topics are Tables
โ RBAC for managing flink Resources
โ Keep in mind: A statementโs
access level is determined
entirely by the permissions that
you attach to the statement
โ Schema Registry, Data Portal,
Lineage, Consumer/Producer
Monitoring, Metric APIโฆ
โ Cluster and Pool need to be in the
same region and same CSP
โ All over the Confluent Organisation
including all environments and
clusters
Flink๋ Confluent Cloud์ ์๋ฒฝํ๊ฒ ํตํฉ๋์์ต๋๋ค
50. AWS์ ์ ์ฉ
ํด๋ฌ์คํฐ๋ฅผ ์ํ Flink
ํ๋ผ์ด๋น ๋คํธ์ํน
Flink์ ๋ํ ๊ฐ์ธ ๋คํธ์ํน ์ง์์ ํตํด
Confluent ์ฌ์ฉ์๋ ๋ค์์ ์ํํ ์ ์์ต๋๋ค.
โ ๋ฐ์ดํฐ ๋ณด์ ๋ฐ ๊ฐ์ธ์ ๋ณด ๋ณดํธ ๊ฐํ
โ ๋ณด์ ๋คํธ์ํฌ ๊ตฌ์ฑ ๊ฐ์ํ
โ ํด๋ฌ์คํฐ ๋ฐ ํ๊ฒฝ ์ ๋ฐ์์ ์์ ํ๊ณ ์ ์ฐํ
์คํธ๋ฆผ ์ฒ๋ฆฌ ์ง์
Env A Env A
Env B Env B
PUBLIC PLATT
Internet Private Link (AWS)
Private
Cluster
(Dedicated,
Enterprise)
Public
Cluster
(Dedicated,
Standard,
Basic)
Private
Cluster
(Dedicated,
Enterprise)
Public
Cluster
(Dedicated,
Standard,
Basic)
โ No access to private clusters โ No cross-env access
โ No egress to public clusters
51. 55
Flink, KStreams ๋ฐ ksqlDB์ ์ฃผ์ ์ฐจ์ด์
Attribute CP Flink CC Flink Kafka Streams ksqlDB
Description
Stream processing framework developed independent of Apache
Kafka
Embeddable client library for
Java applications that is part
of the Apache Kafka project
Stream processing framework
that exposes Kafka Streams
functionality through SQL
Processing
modes
โ Unified stream and batch processing
โ Supports reads from multiple Kafka clusters
โ Stream processing only
โ Supports reads from
single Kafka cluster
โ Stream processing only
โ Supports reads from
single Kafka cluster
Pricing
โ Restore state after failure from most recent incremental
snapshot
โ Restore state after failure
by replaying all messages
โ Restore state after failure
by replaying all messages
CFLT
deployment
model
โ Self-managed offering
with Confluent Platform
โ Fully managed
โ No cluster deployment,
scales to zero
โ Self-managed
โ Embeddable client library
with no cluster
โ Fully managed and self-
managed
โ Separate cluster
deployment
Language
flexibility
โ Full support of all Flink
APIs (SQL, Table API,
DataStream,
ProcessFunction)
โ ANSI-compliant SQL
โ Java UDFs EA
โ Table API Open preview
โ Java (more flexible than
SQL, but more complex)
โ SQL syntax inspired by
ANSI SQL
We recommend Confluent Cloud for Apache Flink for all new cloud workloads