SlideShare a Scribd company logo
The Next Generation of Data
Processing & Open Source
James Malone, Google Product Manager, Apache Beam PPMC
Eric Schmidt, Google Developer Relations
Agenda
1
2
3
4
5
6
The Last Generation - Common historical challenges in large-scale data processing
The Next Generation - How large-scale data processing should work
Apache Beam - A solution for next generation data processing
Why Beam matters - A gaming example to show the power of the Beam model
Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds
Things to Remember - Recap and how you can get involved
2
3
Common historical challenges in large-scale data processing
01 The Last Generation
Decide on tool Read docs
Get
infrastructure
Setup tools Tune tools
Productionize Get Specialists
Optimistic
Frustrated
Setting up infrastructure
Batch model
Streaming
model
Batch use case
Streaming use
case
Streaming
engine
Batch engine
Batch output
Streaming
output
Join output
Optimistic
Frustrated
Programming models
Data model
Data pipeline
Execution
engine 1
Data model
Data pipeline
Execution
engine 1
Data model
Data pipeline
Execution
engine 1
FrustratedHappy
Data pipeline portability
Infrastructure is a pain
Models are disconnected
Pipelines are not portable
7
8
How data processing should work
02 The Next Generation
9
Infrastructure is a pain an afterthought
Models are disconnected unified
Pipelines are not portable portable
Skim docs
Decide on
product
Start service
Optimistic
Happy
Setting up infrastructure
Unified model
Batch use case
Runner(s)
Streaming use
case
Output
Optimistic
Happy
A flexible (unified) model
Data model
Data pipeline
Execution
engine
Execution
engine
Execution
engine
Happy
Happier
Portable data pipelines
Why does this matter?
More time can be dedicated
to examining data for
actionable insights
Less time is spent wrangling
code, infrastructure, and
tools used to process data
Hands-on with data
Cloud setup and
customization
14
A solution for next generation data processing
03 Apache Beam (incubating)
What is Apache Beam?
1. The (unified stream + batch) Dataflow Beam programming model
2. Java and Python SDKs
3. Runners for Existing Distributed Processing Backends
a. Apache Flink (thanks to dataArtisans)
b. Apache Spark (thanks to Cloudera & PayPal)
c. Google Cloud Dataflow (fast, no-ops)
d. Local (in-process) runner for testing
+ Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others!
15
The Apache Beam vision
1. End users: who want to write pipelines
in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a distributed
processing environment and want to
support Beam pipelines
16
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution
Joining several threads into Beam
17
MapReduce
BigTable DremelColossus
FlumeMegastore
SpannerPubSub
Millwheel
Cloud
Dataflow
Cloud
Dataproc
Apache
Beam
Creating an Apache Beam community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so beam is a
part of the larger OSS ecosystem
Learn - We (Google) are also learning a lot as this is
our first data-related Apache contribution ;-)
Apache Beam Roadmap
02/01/2016
Enter Apache
Incubator
End 2016
Beam pipelines
run on many
runners in
production uses
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Additional refactoring,
non-production uses
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
06/14/2016
1st incubating
release
June 2016
Python SDK
moves to
Beam
20
An example to show the power of the Beam model
04 Why Beam Matters
Apache Beam - A next generation model
21
Improved abstractions let you focus on
your business logic
Batch and stream processing are both
first-class citizens -- no need to choose.
Clearly separates event time from
processing time.
Processing time vs. event time
22
Beam model - asking the right questions
23
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
The Beam model - what is being computed?
24
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam model - what is being computed?
25
The Beam model - where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam model - where in event time?
The Beam model - when in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam model - when in processing time?
The Beam model - how do refinements relate?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam model - how do refinements relate?
Customizing what where when how
32
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Apache Beam - the ecosystem
33http://beam.incubator.apache.org/capability-matrix
34
Lets run a Beam pipeline on 3 engines in 2 separate locations
05 Demo
35
Created 1 Beam pipeline
Ran that one pipeline on three execution engines in two places
● Google Cloud Platform
○ Google Cloud Dataflow
○ Apache Spark on Google Cloud Dataproc
● Local
○ Apache Beam local runner
○ Apache Flink
100% portability, 0 problems
What we just did
36
Recap and how you can get involved
06 Things to remember
Apache Beam is
designed to provide
potable pipelines
with a unified
programming
model 37
Get involved with Apache Beam
38
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Join the Apache Beam Slack channel
https://apachebeam.slack.com
Follow @ApacheBeam on Twitter
A special thank you
39
A special thank you to Frances Perry and Tyler Akidau for sharing Apache
Beam content which was used in this presentation.
40
Thank you

More Related Content

What's hot (20)

PPTX
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
DataWorks Summit/Hadoop Summit
 
PPTX
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Visualizing Big Data in Realtime
DataWorks Summit
 
PPTX
What's new in apache hive
DataWorks Summit
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PDF
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PPTX
Lego-like building blocks of Storm and Spark Streaming Pipelines
DataWorks Summit/Hadoop Summit
 
PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PPTX
Insights into Real World Data Management Challenges
DataWorks Summit
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Automated Analytics at Scale
DataWorks Summit/Hadoop Summit
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
PPTX
Integrating Apache Phoenix with Distributed Query Engines
DataWorks Summit
 
PPTX
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
DataWorks Summit
 
PPTX
Big data at United Airlines
DataWorks Summit
 
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
Building and managing complex dependencies pipeline using Apache Oozie
DataWorks Summit/Hadoop Summit
 
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Visualizing Big Data in Realtime
DataWorks Summit
 
What's new in apache hive
DataWorks Summit
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Lego-like building blocks of Storm and Spark Streaming Pipelines
DataWorks Summit/Hadoop Summit
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Insights into Real World Data Management Challenges
DataWorks Summit
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Automated Analytics at Scale
DataWorks Summit/Hadoop Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
DataWorks Summit
 
Integrating Apache Phoenix with Distributed Query Engines
DataWorks Summit
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
DataWorks Summit
 
Big data at United Airlines
DataWorks Summit
 

Viewers also liked (20)

PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
PPTX
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
PDF
Maximilian Michels - Flink and Beam
Flink Forward
 
PPT
Introduction to Drools
giurca
 
ODP
Open source and business rules
Geoffrey De Smet
 
PDF
FOSS in the Enterprise
Crishantha Nanayakkara
 
PPTX
Jboss drools 4 scope - benefits, shortfalls
Zoran Hristov
 
PDF
Drools & jBPM Workshop London 2013
Mauricio (Salaboy) Salatino
 
ODP
Drools BeJUG 2010
Geoffrey De Smet
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PDF
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Mauricio (Salaboy) Salatino
 
ODP
Drools & jBPM Info Sheet
Mark Proctor
 
PDF
Intro to Drools - St Louis Gateway JUG
Ray Ploski
 
PDF
Rules Programming tutorial
Srinath Perera
 
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
PDF
Apache Beam @ GCPUG.TW Flink.TW 20161006
Randy Huang
 
PDF
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
PDF
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
PDF
Drools
John Paulett
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Maximilian Michels - Flink and Beam
Flink Forward
 
Introduction to Drools
giurca
 
Open source and business rules
Geoffrey De Smet
 
FOSS in the Enterprise
Crishantha Nanayakkara
 
Jboss drools 4 scope - benefits, shortfalls
Zoran Hristov
 
Drools & jBPM Workshop London 2013
Mauricio (Salaboy) Salatino
 
Drools BeJUG 2010
Geoffrey De Smet
 
Apache Beam (incubating)
Apache Apex
 
Drools5 Community Training Module 5 Drools BLIP Architectural Overview + Demos
Mauricio (Salaboy) Salatino
 
Drools & jBPM Info Sheet
Mark Proctor
 
Intro to Drools - St Louis Gateway JUG
Ray Ploski
 
Rules Programming tutorial
Srinath Perera
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Neville Li
 
Apache Beam @ GCPUG.TW Flink.TW 20161006
Randy Huang
 
Introduction to Apache Beam (incubating) - DataCamp Salzburg - 7 dec 2016
Sergio Fernández
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Dan Halperin
 
Drools
John Paulett
 
Ad

Similar to The Next Generation of Data Processing and Open Source (20)

PDF
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
PDF
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
PDF
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
PDF
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
PDF
Realizing the promise of portability with Apache Beam
J On The Beach
 
PPTX
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
PDF
Introduction to Apache Beam
Knoldus Inc.
 
PPTX
Portable Streaming Pipelines with Apache Beam
confluent
 
PDF
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
PDF
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
PDF
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
PDF
Introduction to Apache Beam
Jean-Baptiste Onofré
 
PPTX
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
PDF
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
PPTX
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
PPTX
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
PDF
Nexmark with beam
Etienne Chauchot
 
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
Portable batch and streaming pipelines with Apache Beam (Big Data Application...
Malo Denielou
 
Realizing the promise of portability with Apache Beam
J On The Beach
 
ApacheBeam_Google_Theater_TalendConnect2017.pptx
RAJA RAY
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Introduction to Apache Beam
Knoldus Inc.
 
Portable Streaming Pipelines with Apache Beam
confluent
 
ApacheBeam_Google_Theater_TalendConnect2017.pdf
RAJA RAY
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward
 
Flink Forward Berlin 2018: Thomas Weise & Aljoscha Krettek - "Python Streamin...
Flink Forward
 
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Python Streaming Pipelines with Beam on Flink
Aljoscha Krettek
 
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Aljoscha Krettek
 
HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase
HBaseCon
 
Introduction to GCP Data Flow Presentation
Knoldus Inc.
 
Introduction to GCP DataFlow Presentation
Knoldus Inc.
 
Nexmark with beam
Etienne Chauchot
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

The Next Generation of Data Processing and Open Source

  • 1. The Next Generation of Data Processing & Open Source James Malone, Google Product Manager, Apache Beam PPMC Eric Schmidt, Google Developer Relations
  • 2. Agenda 1 2 3 4 5 6 The Last Generation - Common historical challenges in large-scale data processing The Next Generation - How large-scale data processing should work Apache Beam - A solution for next generation data processing Why Beam matters - A gaming example to show the power of the Beam model Demo - Lets run a Beam pipeline on 3 engines in 2 separate clouds Things to Remember - Recap and how you can get involved 2
  • 3. 3 Common historical challenges in large-scale data processing 01 The Last Generation
  • 4. Decide on tool Read docs Get infrastructure Setup tools Tune tools Productionize Get Specialists Optimistic Frustrated Setting up infrastructure
  • 5. Batch model Streaming model Batch use case Streaming use case Streaming engine Batch engine Batch output Streaming output Join output Optimistic Frustrated Programming models
  • 6. Data model Data pipeline Execution engine 1 Data model Data pipeline Execution engine 1 Data model Data pipeline Execution engine 1 FrustratedHappy Data pipeline portability
  • 7. Infrastructure is a pain Models are disconnected Pipelines are not portable 7
  • 8. 8 How data processing should work 02 The Next Generation
  • 9. 9 Infrastructure is a pain an afterthought Models are disconnected unified Pipelines are not portable portable
  • 10. Skim docs Decide on product Start service Optimistic Happy Setting up infrastructure
  • 11. Unified model Batch use case Runner(s) Streaming use case Output Optimistic Happy A flexible (unified) model
  • 13. Why does this matter? More time can be dedicated to examining data for actionable insights Less time is spent wrangling code, infrastructure, and tools used to process data Hands-on with data Cloud setup and customization
  • 14. 14 A solution for next generation data processing 03 Apache Beam (incubating)
  • 15. What is Apache Beam? 1. The (unified stream + batch) Dataflow Beam programming model 2. Java and Python SDKs 3. Runners for Existing Distributed Processing Backends a. Apache Flink (thanks to dataArtisans) b. Apache Spark (thanks to Cloudera & PayPal) c. Google Cloud Dataflow (fast, no-ops) d. Local (in-process) runner for testing + Future runners for Beam - Apache Gearpump, Apache Apex, MapReduce, others! 15
  • 16. The Apache Beam vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines 16 Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  • 17. Joining several threads into Beam 17 MapReduce BigTable DremelColossus FlumeMegastore SpannerPubSub Millwheel Cloud Dataflow Cloud Dataproc Apache Beam
  • 18. Creating an Apache Beam community Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so beam is a part of the larger OSS ecosystem Learn - We (Google) are also learning a lot as this is our first data-related Apache contribution ;-)
  • 19. Apache Beam Roadmap 02/01/2016 Enter Apache Incubator End 2016 Beam pipelines run on many runners in production uses Early 2016 Design for use cases, begin refactoring Mid 2016 Additional refactoring, non-production uses Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository 06/14/2016 1st incubating release June 2016 Python SDK moves to Beam
  • 20. 20 An example to show the power of the Beam model 04 Why Beam Matters
  • 21. Apache Beam - A next generation model 21 Improved abstractions let you focus on your business logic Batch and stream processing are both first-class citizens -- no need to choose. Clearly separates event time from processing time.
  • 22. Processing time vs. event time 22
  • 23. Beam model - asking the right questions 23 What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 24. The Beam model - what is being computed? 24 PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey());
  • 25. The Beam model - what is being computed? 25
  • 26. The Beam model - where in event time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());
  • 27. The Beam model - where in event time?
  • 28. The Beam model - when in processing time? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 29. The Beam model - when in processing time?
  • 30. The Beam model - how do refinements relate? PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());
  • 31. The Beam model - how do refinements relate?
  • 32. Customizing what where when how 32 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 33. Apache Beam - the ecosystem 33http://beam.incubator.apache.org/capability-matrix
  • 34. 34 Lets run a Beam pipeline on 3 engines in 2 separate locations 05 Demo
  • 35. 35 Created 1 Beam pipeline Ran that one pipeline on three execution engines in two places ● Google Cloud Platform ○ Google Cloud Dataflow ○ Apache Spark on Google Cloud Dataproc ● Local ○ Apache Beam local runner ○ Apache Flink 100% portability, 0 problems What we just did
  • 36. 36 Recap and how you can get involved 06 Things to remember
  • 37. Apache Beam is designed to provide potable pipelines with a unified programming model 37
  • 38. Get involved with Apache Beam 38 Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! [email protected] [email protected] Join the Apache Beam Slack channel https://apachebeam.slack.com Follow @ApacheBeam on Twitter
  • 39. A special thank you 39 A special thank you to Frances Perry and Tyler Akidau for sharing Apache Beam content which was used in this presentation.