SlideShare a Scribd company logo
Abstract
Unbounded, unordered, global­ scale datasets are increasingly common in day-­to-­day business, and consumers of
these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data
processing programming model that evolved from more than a decade of experience building Big Data infrastructure
within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow.
Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates
properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time
environments, both open ­source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and
proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling,
resulting in an efficient execution.
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful
programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in
real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark
and Google Cloud Dataflow in a live setting.
This session is a Technical (Intermediate) talk in our IoT and Streaming track. It focuses on Apache Flink, Apache
Kafka, Apache Spark, Cloud and is geared towards Architect, Data Scientist, Developer / Engineer audiences.
Unified, Efficient and
Portable Data Processing
with Apache Beam
Davor Bonaci
PMC Chair, Apache Beam
Software Engineer, Google Inc.
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
The evolution of Apache Beam
MapReduce Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
Agenda
1. Expressing data-parallel pipelines with the Beam model
2. The Beam vision for portability
3. Parallel and portable pipelines in practice
Apache Beam is
a unified programming model
designed to provide
portable data processing pipelines
(efficient too)
Expressing
data-parallel pipelines
with the Beam model
A unified model for batch and
streaming
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
The Beam vision for
portablility
Write once,
run anywhere“
”
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
Language A
SDK
Language C
SDK
Language B
SDK
● Beam’s Java SDK runs on multiple
runtime environments, including:
○ Apache Apex
○ Apache Spark
○ Apache Flink
○ Google Cloud Dataflow
○ [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
○ Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Vision: as of April 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Categorizing Runner Capabilities
https://beam.apache.org/
documentation/runners/capability-matrix/
Parallel and portable
pipelines in practice
Demo and Use Case
Demo!
Getting Started with Apache Beam
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Extensive documentation
Related sessions
Hadoop Summit San Jose 2016
● Apache Beam: A Unified Model for Batch and Streaming Data Processing
○ Speaker: Davor Bonaci
Hadoop Summit Melbourne 2016
● Stream/Batch processing portable across on-premise and Cloud with Apache Beam
○ Speaker: Eric Anderson
DataWorks Summit San Jose 2017
● Realizing the promise of portable data processing with Apache Beam
○ Speaker: Davor Bonaci
● Stateful processing of massive out-of-order streams with Apache Beam
○ Speaker: Kenneth Knowles
Apache Beam is
a unified programming model
designed to provide
portable data processing pipelines
(efficient too)
Learn more!
Apache Beam
https://beam.apache.org
Join the Beam mailing lists!
user-subscribe@beam.apache.org
dev-subscribe@beam.apache.org
Follow @ApacheBeam on Twitter
Demo screenshots
because if I make them, I won’t
need to use them
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam

More Related Content

What's hot (20)

PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PDF
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
PDF
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
PPTX
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
PPTX
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
DataWorks Summit
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PPTX
Large-scaled telematics analytics
DataWorks Summit
 
PPTX
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
PDF
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
PDF
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
PPTX
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
DataWorks Summit/Hadoop Summit
 
Spark Technology Center IBM
DataWorks Summit/Hadoop Summit
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
DataWorks Summit
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Large-scaled telematics analytics
DataWorks Summit
 
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Apache Zeppelin Meetup Christian Tzolov 1/21/16
PivotalOpenSourceHub
 

Similar to Unified, Efficient, and Portable Data Processing with Apache Beam (20)

PDF
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
PDF
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
PPTX
Portable Streaming Pipelines with Apache Beam
confluent
 
PDF
Realizing the promise of portability with Apache Beam
J On The Beach
 
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
PPTX
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
PDF
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
PDF
Introduction to Apache Beam
Knoldus Inc.
 
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
PDF
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
PDF
hbaseconasia2017: HBase on Beam
HBaseCon
 
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
PDF
Maximilian Michels - Flink and Beam
Flink Forward
 
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Portable Streaming Pipelines with Apache Beam
confluent
 
Realizing the promise of portability with Apache Beam
J On The Beach
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Introduction to Apache Beam
Knoldus Inc.
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...
Flink Forward
 
Apache Beam (incubating)
Apache Apex
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
hbaseconasia2017: HBase on Beam
HBaseCon
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
Maximilian Michels - Flink and Beam
Flink Forward
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Unified, Efficient, and Portable Data Processing with Apache Beam

  • 1. Abstract Unbounded, unordered, global­ scale datasets are increasingly common in day-­to-­day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open ­source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution. This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting. This session is a Technical (Intermediate) talk in our IoT and Streaming track. It focuses on Apache Flink, Apache Kafka, Apache Spark, Cloud and is geared towards Architect, Data Scientist, Developer / Engineer audiences.
  • 2. Unified, Efficient and Portable Data Processing with Apache Beam Davor Bonaci PMC Chair, Apache Beam Software Engineer, Google Inc.
  • 3. Apache Beam: Open Source data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 4. The evolution of Apache Beam MapReduce Apache Beam Cloud Dataflow BigTable DremelColossus FlumeMegastore Spanner PubSub Millwheel
  • 5. Agenda 1. Expressing data-parallel pipelines with the Beam model 2. The Beam vision for portability 3. Parallel and portable pipelines in practice
  • 6. Apache Beam is a unified programming model designed to provide portable data processing pipelines (efficient too)
  • 7. Expressing data-parallel pipelines with the Beam model A unified model for batch and streaming
  • 8. Processing time vs. event time
  • 9. The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 10. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 11. The Beam Model: What is being computed?
  • 12. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 13. The Beam Model: Where in event time?
  • 14. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 15. The Beam Model: When in processing time?
  • 16. PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 17. The Beam Model: How do refinements relate?
  • 18. Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 19. The Beam vision for portablility Write once, run anywhere“ ”
  • 20. Beam Vision: mix and match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache Beam Runner 1 Runner 3Runner 2 ● Choice of SDK: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runners: Users choose the right runtime for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model Language A SDK Language C SDK Language B SDK
  • 21. ● Beam’s Java SDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Spark ○ Apache Flink ○ Google Cloud Dataflow ○ [in development] Apache Gearpump ● Cross-language infrastructure is in progress. ○ Beam’s Python SDK currently runs on Google Cloud Dataflow Beam Vision: as of April 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  • 22. Example Beam Runners Apache Spark ● Open-source cluster-computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 23. How do you build an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 24. Beam: the intersection of runner functionality?
  • 25. Beam: the union of runner functionality?
  • 28. Parallel and portable pipelines in practice Demo and Use Case
  • 29. Demo!
  • 30. Getting Started with Apache Beam Quickstarts ● Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Extensive documentation
  • 31. Related sessions Hadoop Summit San Jose 2016 ● Apache Beam: A Unified Model for Batch and Streaming Data Processing ○ Speaker: Davor Bonaci Hadoop Summit Melbourne 2016 ● Stream/Batch processing portable across on-premise and Cloud with Apache Beam ○ Speaker: Eric Anderson DataWorks Summit San Jose 2017 ● Realizing the promise of portable data processing with Apache Beam ○ Speaker: Davor Bonaci ● Stateful processing of massive out-of-order streams with Apache Beam ○ Speaker: Kenneth Knowles
  • 32. Apache Beam is a unified programming model designed to provide portable data processing pipelines (efficient too)
  • 34. Demo screenshots because if I make them, I won’t need to use them