SlideShare a Scribd company logo
Imre Nagi
Traveloka Data
@imrenagi
Jakarta
Google Cloud Dataflow
Unified Model for Stream and Batch
processing
Imre Nagi
Ping me @imrenagi
Previously:
Software Engineer @CERN @eBay Inc
Currently:
Software Engineer @Traveloka Data
Docker Community Leader, Indonesia
Agenda
What is Dataflow?
Dataflow Abstraction
Dataflow Common Pipeline
Stream Analytics
What is
Dataflow?
Jakarta
Apache Beam ...
A set of SDK that define
programming model that
you use to build your
stream and batch
processing pipeline
Cloud Dataflow
Fully managed distributed
service that runs and optimizes
your beam pipeline
Jakarta
Dataflow for ..
● Move
● Filter
● Enrich
ETL
● Connecting to Cloud Pub/Sub
● Read and Write to BigQuery,
Bigtable, etc.
I/O Operation
● Streaming Computing
● Batch Computing
● Machine Learning
Analytics
Jakarta
Unified Programming Model
Unified: Stream & Batch Pipeline
Open Source:
Java SDK
Python SDK
Go SDK (New)
Jakarta
Cloud Pub/Sub Cloud Dataflow
(Streaming)
Cloud Bigquery
Source Processing Data Store
Unified Model (Streaming)
Jakarta
Cloud Pub/Sub
Cloud Storage
Cloud Dataflow
(Streaming)
Cloud Bigquery
Source Processing Data Store
Unified Model (Streaming & Backup)
Jakarta
Cloud Storage Cloud Dataflow
(Batch)
Cloud Bigquery
Source Processing Data Store
Unified Model (Batch)
Dataflow
Abstraction
Jakarta
Jakarta
Represents graph of data
processing transformation
PCollection flows through
the pipeline
Can have multiple I/O in the
beginning and end of
pipeline
Beam Pipeline
Jakarta
// Define the pipeline option
PipelineOptions options = PipelineOptionsFactory.create();
// Create the pipeline
Pipeline p = Pipeline.create(options);
Jakarta
Data Model
PCollection<T> is a
collection of data type T
May be bounded or
unbounded in size
Element might has implicit
or explicit timestamp
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(TextIO.read().from("/path/to/some/inputData.txt"));
PCollection<String> linesGCS = p.apply(TextIO.read().from("gs://deeptech/*"));
static final List<String> LINES = Arrays.asList(
"This is the first line",
"You will say this one is the second",
"But it's not. ");
// Generating PCollection from in memory data
PCollection<String> lines = p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of())
// Generate bounded pcollection
PCollection<Long> bounded = p.apply(GenerateSequence.from(0).to(1000));
// Generate unbounded pcollection
PCollection<Long> unbounded = p.apply(GenerateSequence.from(0));
Jakarta
PTransform: Transforming the Data
public class HelloDoFn extends DoFn<String, String> {
@ProcessElement
public void processElement(ProcessContext context) {
String name = context.element();
context.output("Hello, " + name + " ! ");
}
}
public class StringToLongDoFn extends DoFn<String, Long> {
@ProcessElement
public void processElement(ProcessContext context) {
String name = context.element();
context.output(name.length());
}
}
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
Jakarta
Jakarta
I/O Transform
Dataflow Common
Pipeline
Jakarta
Jakarta
Linear Pipeline
Jakarta
Combining Multiple
PCollection
Jakarta
Producing Multiple
PCollections
Jakarta
Multiple Transformation for a PCollection
Jakarta
Joining PCollection
Stream
Analytics
Jakarta
Jakarta
data..
Jakarta
Can be big..
Jakarta
Tuesday
Wednesday
Thursday
Bigger and bigger..
...maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
Jakarta
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Oops.. unknown delays
Jakarta
Lambda Architecture ever says that Stream Processing only CAN’T produce
accurate analytics result. Thus, Batch Processing is necessary to fix the
inaccuracy of the stream processing.
Jakarta
Jakarta
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
∑ ∑ ∑ ∑ ∑ ∑ ∑8:00 8:00
Grouping via Processing-Time Windows
Jakarta
Processing
Time
11:0010:00 15:0014:0013:0012:00
Grouping via event-time windows
Event Time 11:0010:00 15:0014:0013:0012:00
Input
Output
∑ ∑ ∑ ∑ ∑ ∑
Jakarta
What is windowing?
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded
data.
Fixed Sliding
1 2 3
54
Key
2
Key
1
Key
3
Time
1 2 3 4 A windowing function
computes which
window(s) an element
belongs to. Temporal
functions can be
parameterized with
duration and
frequency.
Jakarta
What about data-dependent windowing?
Sessions
2
431
Time
Unique per key - you
can't know a priori
when a session ends,
so the windowing
function is now also
parameterized by
state.
PCollection<KV<String, Integer>> scores = input.apply(
Window.into(FixedWindows.of(Duration.standardMinutes(2))))
.apply(Sum.integersPerKey());
Jakarta
Jakarta
Windowing specifies where events are aggregated in event time,
but when are events emitted in processing time?
Jakarta
Trigger
Triggers: A trigger is a mechanism for declaring
when the output for a window should be
materialized relative to some external signal.
Triggers provide flexibility in choosing when
outputs should be emitted.
They also make it possible to observe the output for a window
multiple times as it evolves
Jakarta
Windowed summation on a streaming engine with perfect (left) and heuristic
(right) watermarks.
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
Jakarta
Windowed summation on a streaming engine with early and late
firings.
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.withAllowedLateness(Duration.standardMinutes(1)))
.apply(Sum.integersPerKey());
Jakarta
Windowed summation on a streaming engine with early and late firings
and allowed lateness
First trigger firing: [5, 8, 3]
Second trigger firing: [5, 8, 3, 15, 19, 23]
Third trigger firing: [5, 8, 3, 15, 19, 23, 9, 13, 10]
Jakarta
Accumulation Modes
First trigger firing: [5, 8, 3]
Second trigger firing: [15, 19, 23]
Third trigger firing: [9, 13, 10]
Jakarta
Discarding Modes
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.discardingFiredPanes())
.apply(Sum.integersPerKey());
Jakarta
Discarding mode version of early/late firings on a streaming engine
Jakarta
1. http://streamingsystems.org/Presentations/Jelena%20Pjesivac-grbo
vic.pdf
2. Stream Analytics with Google Cloud Dataflow: Use Cases &
Patterns, Gaurav Anand
3. Streaming 101 & 102, Tyler Akidau
4. https://streamingbook.net
5. Apache Beam Documentation
Google Slide version from this slide can be accessed from:
https://docs.google.com/presentation/d/1Ws73JxlVH39HiKiYuF3vW
903j8wFzxPQihXz4CQ_HZM/edit?usp=sharing
Credits to:

More Related Content

What's hot (19)

PDF
Where should I run my code? Serverless, Containers, Virtual Machines and more
Bret McGowen - NYC Google Developer Advocate
 
PDF
Google Cloud Platform Special Training
Simon Su
 
PDF
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
HostedbyConfluent
 
PPTX
Go Serverless with Azure
Sergey Seletsky
 
PPTX
Serverless and Servicefull Applications - Where Microservices complements Ser...
Red Hat Developers
 
PPTX
Cqrs and event sourcing in azure
Sergey Seletsky
 
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
PPTX
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
Codemotion
 
PPTX
When IoT meets Serverless - from design to production and monitoring
Alex Pshul
 
PPTX
Intellias CQRS Framework
Sergey Seletsky
 
PDF
Google Cloud Platform Solutions for DevOps Engineers
Márton Kodok
 
PDF
Google Cloud Platform Kubernetes Workshop IYTE
Gokhan Boranalp
 
PDF
CNCF, State of Serverless & Project Nuclio
Lee Calcote
 
PDF
Tu non puoi passare! Policy compliance con OPA Gatekeeper | Niccolò Raspa
KCDItaly
 
PPTX
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Jinwoong Kim
 
PDF
Cncf event driven autoscaling with keda
JurajHantk
 
PPTX
CQRS and Event Sourcing
Sergey Seletsky
 
PPTX
KEDA Overview
Jeff Hollan
 
PDF
What Does Kubernetes Look Like?: Performance Monitoring & Visualization with ...
InfluxData
 
Where should I run my code? Serverless, Containers, Virtual Machines and more
Bret McGowen - NYC Google Developer Advocate
 
Google Cloud Platform Special Training
Simon Su
 
Live Event Debugging With ksqlDB at Reddit | Hannah Hagen and Paul Kiernan, R...
HostedbyConfluent
 
Go Serverless with Azure
Sergey Seletsky
 
Serverless and Servicefull Applications - Where Microservices complements Ser...
Red Hat Developers
 
Cqrs and event sourcing in azure
Sergey Seletsky
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Codemotion
 
Microservice Plumbing - Glynn Bird - Codemotion Rome 2017
Codemotion
 
When IoT meets Serverless - from design to production and monitoring
Alex Pshul
 
Intellias CQRS Framework
Sergey Seletsky
 
Google Cloud Platform Solutions for DevOps Engineers
Márton Kodok
 
Google Cloud Platform Kubernetes Workshop IYTE
Gokhan Boranalp
 
CNCF, State of Serverless & Project Nuclio
Lee Calcote
 
Tu non puoi passare! Policy compliance con OPA Gatekeeper | Niccolò Raspa
KCDItaly
 
Cloud Native 오픈소스 서비스 소개 및 Serverless로 실제 게임 개발하기
Jinwoong Kim
 
Cncf event driven autoscaling with keda
JurajHantk
 
CQRS and Event Sourcing
Sergey Seletsky
 
KEDA Overview
Jeff Hollan
 
What Does Kubernetes Look Like?: Performance Monitoring & Visualization with ...
InfluxData
 

Similar to GDG Jakarta Meetup - Streaming Analytics With Apache Beam (20)

PDF
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Piotr Wikiel
 
PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PDF
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
PDF
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
 
PDF
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
PDF
JCConf 2016 - Google Dataflow 小試
Simon Su
 
PDF
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
PDF
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PDF
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
PDF
Dsdt meetup 2017 11-21
JDA Labs MTL
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Piotr Wikiel
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Piotr Wikiel
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
Lucas Arruda
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
tdc-globalcode
 
Apache Flink Training: DataStream API Part 2 Advanced
Flink Forward
 
JCConf 2016 - Google Dataflow 小試
Simon Su
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Dsdt meetup 2017 11-21
JDA Labs MTL
 
Ad

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
fashion industry boom.pptx an economics project
TGMPandeyji
 
PDF
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
fashion industry boom.pptx an economics project
TGMPandeyji
 
Context Engineering vs. Prompt Engineering, A Comprehensive Guide.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Ad

GDG Jakarta Meetup - Streaming Analytics With Apache Beam