Unify Stream and Batch Processing using Dataflow, a Portable Programmable Model from Google

Google Cloud Dataflow
Eric Schmidt, Product Manager
cloude@google.com

You leave here understanding the fundamentals of Cloud Dataflow and
possibly have drawn some comparisons to existing data processing models.
We have some fun.
1
Goals
2

Time to answer some questions
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
What was the average listening time over the
past 7 days, compared to the same period last
year?
What songs are most likely to follow a song?
How many active listeners did I have in the last
minute?
How many sales were made in the last
hour due to advertising conversion & what was
the geographic source of the users?

❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is wasteful and
can create lag
The reality of Big Data elasticity & business

Let’s build something
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year

The tense quadrachotomy of Big Data
AccuracySpeed
Cost Control Complexity
Time to Answer

Management MobileDeveloper
Tools
Compute
Networking
Big Data
Storage

Before Google Cloud Dataflow
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
OR
OR
OR
OR

Cloud Dataflow
Cloud Dataflow is a
collection of SDKs for
building batch or
streaming parallelized
data processing pipelines.
Cloud Dataflow is a fully
managed service for
executing optimized
parallelized data processing
pipelines.

After Google Cloud Dataflow
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
AND
AND
AND
AND

• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• External
orchestration
• Simulation
Where might you use Cloud Dataflow?
AnalysisETL Orchestration

Benefits of Cloud Dataflow
❯ No Ops - truly elastic data processing for the cloud
• On demand resource allocation w/intelligent auto-scaling
• Automated worker lifetime management
• Automated work optimization
❯ Unified model - for batch & stream based processing
• Functional programming model
• Fine grained correctness primitives
❯ Open sourced - SDK @ github
• Java 7 today @ /GoogleCloudPlatform/DataflowJavaSDK
• Python 2 in progress
• Scala @/darkjh/scalaflow & /jhlch/scala-dataflow-dsl
• Spark runner@ /cloudera/spark-dataflow
• Flink runner @ /dataArtisans/flink-dataflow

Release Timeline
• June 24, 2014: Early Access Preview at Google I/O
• Dec. 17, 2014: Alpha
• Apr. 15, 2015: Beta
• Next up GA!

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Cloud Dataflow BigQuery
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Cloud Dataflow
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Cloud Dataflow SDK
+
minute?

1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Cloud Pub/Sub Cloud Dataflow BigQuery
minute?

Big Data on Google Cloud
BigQuery
Ingest data at 100,000
rows per second
Dataflow
Stream & batch
processing, unified and
simplified
Pub/Sub
Scalable, flexible, and
globally available
messaging
Fully Managed, No-Ops Services

Let’s build something - Demo!
Create a globally available queue
Create a dataset for massive scale ingest and query execution
Create a pipeline and run a Dataflow job

Big Query
Google Big Query
Fast ETL
Regex
JSON
Spreadsheets
BI Tools
Coworkers
Your Data
• Scales into Petabytes
• I/O of TBs in seconds
• 100,000 rows/sec per table Streaming API
• Simple data ingest from GCS or Hadoop
• Connect to R, Pandas, Hadoop, Dataflow, etc.
• Row level security and data expiration

• Globally redundant
• Low latency ~100ms
• Batched read/write
• Custom labels
• Push & Pull
• Auto expiration
Cloud Pub/Sub
Publisher A Publisher B Publisher C
Message 1
Topic A Topic B Topic C
Subscription XA Subscription XB
Subscription
YC
Subscription
ZC
Cloud
Pub/Sub
Subscriber X Subscriber Y
Message 2 Message 3
Subscriber Z
Message 1
Message 2
Message 3
Message 3

weekly monthly
Cloud Dataflow SDK Release Process

<- At once guarantee (modulo correctness thresholds)
Cloud Dataflow SDK
<- Aggregations, Filters, Joins, ...
<- Correctness
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Cloud Dataflow SDK - Logical Model
<- GCS, Pub/Sub, BigQuery, Bigtable, custom
<- Time space Fixed, Sliding, Sessions, ...
<- GCS, Pub/Sub, BigQuery, Bigtable, custom

Pipeline p = Pipeline.create(
OptionsBuilder.RunOnService(true, false));
PCollection<String> rawData = p.begin().apply(TextIO.Read
.from(OptionsBuilder.GCS_RAWDUMP_URI));
PCollection<PlaybackEvent> events = rawData.apply(
new ParseTransform());
events.apply(new ArchiveTransform());
events.apply(new SessionAnalysisTransform());
events.apply(new AssetTransform());
p.run();
Java 7 Implementation
Some Code

❯ A collection of data of type T in a
pipeline - a “hippie cousin” of an RDD
❯ Maybe be either bounded or
unbounded in size
❯ Created by using a PTransform to:
• Build from a java.util.Collection
• Read from a backing data store
• Transform an existing
PCollection
❯ Often contain the key-value pairs using
KV
{Seahawks, NFC, Champions, Seattle, ...}
{...,
“NFC Champions #GreenBay”,
“Green Bay #superbowl!”,
...
“#GoHawks”,
...}
PCollections
Cloud Dataflow SDK

{Seahawks, NFC, Champions, Seattle, ...}
{KV<S, Seahawks>, KV<C,Champions>,
<KV<S, Seattle>, KV<N, NFC>, ...}
❯ Processes each element of a
PCollection independently using a user-
provided DoFn
❯ Elements are processed in arbitrary
‘bundles’ e.g. “shards”
• startBundle(), processElement() -
N times, finishBundle()
❯ Corresponds to both the Map and
Reduce phases in Hadoop i.e. ParDo-
>GBK->ParDo
KeyBySessionId
ParDo (“Parallel Do”)
Cloud Dataflow SDK

Wait a minute…
How do you do a GroupByKey on an unbounded PCollection?
GroupByKey
• Takes a PCollection of key-value
pairs and gathers up all values
with the same key
• Corresponds to the shuffle phase
in Hadoop
Cloud Dataflow SDK
GroupByKey
{KV<S, {Seahawks, Seattle, …},
KV<N, {NFC, …}
KV<C, {Champion, …}}

❯ Logically divide up or groups the elements of a
PCollection into finite windows
• Fixed Windows: hourly, daily, …
• Sliding Windows
• Sessions
❯ Required for GroupByKey-based transforms on
an unbounded PCollection, but can also be used
for bounded PCollections
❯ Window.into() can be called at any point in the
pipeline and will be applied when needed
❯ Can be tied to arrival/processing time or custom
event time
❯ Watermarks + Triggers enable robust
correctness
Windows
Cloud Dataflow SDK
Nighttime Mid-Day Nighttime

Cloud Dataflow Service
Managing with correctness
.apply(Window.<KV<String, PlaybackEvent>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterEach.inOrder(
AfterWatermark.pastEndOfWindow(),
Repeatedly.forever(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(10)))
.orFinally(AfterWatermark
.pastEndOfWindow()
.plusDelayOf(Duration.standardDays(2)))))
.discardingFiredPanes());

GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some utilities are included in the SDK
• Count, RemoveDuplicates, Join,
Min, Max, Sum, ...
❯ You can define your own:
• DoSomething, DoSomethingElse,
etc.
❯ Why bother?
• Code reuse
• Better monitoring experience
Composite PTransforms
Cloud Dataflow SDK

GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Progress & Logs

● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations before
reduction
● Reshard placement
...
Graph Optimization
C D
C+D
consumer-producer
= ParallelDo
GBK = GroupByKey
+ = CombineValues
sibling
C D
C+D
A GBK + B
A+ GBK + B
combiner lifting

Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management

Worker Scaling
Decreased Clock Time

100 mins. 65 mins.
vs.
Dynamic Work Rebalancing

Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monitoring
Reliability
Deployment &
configuration
Handling
Growing Scale
Utilization
improvements
Data Processing with Cloud DataflowTypical Data Processing
Programming

Thank You!
cloud.google.com/dataflow
cloude@google.com
StackOverflow @ google+cloud+dataflow

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Model from Google

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Unify Stream and Batch Processing using Dataflow, a Portable Programmable Model from Google (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Unify Stream and Batch Processing using Dataflow, a Portable Programmable Model from Google