SlideShare a Scribd company logo
Have Your Cake
and
Eat It Too
Architectures for Batch and Stream
Processing
Speaker name // Speaker title
2
Stuff We’ll Talk About
• Why do we need both streams and batches
• Why is it a problem?
• Stream-Only Patterns (i.e. Kappa Architecture)
• Lambda-Architecture Technologies
– SummingBird
– Apache Spark
– Apache Flink
– Bring-your-own-framework
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
About Me
4
Why Streaming
and Batch
©2014 Cloudera, Inc. All rights reserved.
5
Batch Processing
• Store data somewhere
• Read large chunks of data
• Do something with data
• Sometimes store results
6
Batch Examples
• Analytics
• ETL / ELT
• Training machine learning models
• Recommendations
Click to enter confidentiality information
7
Stream Processing
• Listen to incoming events
• Do something with each event
• Maybe store events / results
Click to enter confidentiality information
8
Stream Processing Examples
• Anomaly detection, alerts
• Monitoring, SLAs
• Operational intelligence
• Analytics, dashboards
• ETL
Click to enter confidentiality information
9
Streaming & Batch
Click to enter confidentiality information
Alerts
Monitoring, SLAs
Operational Intelligence
Risk Analysis
Anomaly
detection
Analytics
ETL
10
Four Categories
• Streams Only
• Batch Only
• Can be done in both
• Must be done in both
Click to enter confidentiality information
ETL
Some Analytics
11
ETL
Most Stream Processing projects I see involve few simple
transformations.
• Currency conversion
• JSON to Avro
• Field extraction
• Joining a stream to a static data set
• Aggregate on window
• Identifying change in trend
• Document indexing
Click to enter confidentiality information
12
Batch || Streaming
• Efficient:
– Lower CPU utilization
– Better network and disk throughput
– Fewer locks and waits
• Easier administration
• Easier integration with RDBMS
• Existing expertise
• Existing tools
• Real-time information
Click to enter confidentiality information
13
The Problem
©2014 Cloudera, Inc. All rights reserved.
14
We Like
• Efficiency
• Scalability
• Fault Tolerance
• Recovery from errors
• Experimenting with different
approaches
• Debuggers
• Cookies
Click to enter confidentiality information
15
But…
We don’t like
Maintaining two applications
That do the same thing
Click to enter confidentiality information
16
Do we really need to maintain same app
twice?
Yes, because:
• We are not sure about requirements
• We sometimes need to re-process
with very high efficiency
Not really:
• Different apps for batch and
streaming
• Can re-process with streams
• Can error-correct with streams
• Can maintain one code-base
for batches and streams
Click to enter confidentiality information
17
Stream-Only
Patterns
(Kappa
Architecture)
Click to enter confidentiality information
18
DWH Example
Click to enter confidentiality information
OLTP DB
Sensors,
Logs
DWH
Fact Table
(Partitioned)
Real Time
Fact Tables
Dimensio
n
Dimensio
n
Dimensio
n
Views
Aggregat
es
App 1:
Stream
processing
App 2:
Occasional load
19
We need to fix older data
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v1
Streaming
App v2
Real-Time
Table
Replacement
Partition
Partitioned
Fact Table
20
We need to fix older data
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v1
Streaming
App v2
Real-Time
Table
Replacement
Partition
Partitioned
Fact Table
21
We need to fix older data
Click to enter confidentiality information
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Streaming
App v2
Real-Time
Table
22
Lambda-
Architecture
Technologies
Click to enter confidentiality information
23
WordCount in Scala
source.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_+_)
.print()
24
SummingBird
25
MapReduce was great because…
Very simple abstraction:
- Map
- Shuffle
- Reduce
- Type-safe
And it has simpler abstractions on top.
26
SummingBird
• Multi-stage MapReduce
• Run on Hadoop, Spark, Storm
• Very easy to combine
batch and streaming results
Click to enter confidentiality information
27
API
• Platform – Storm, Scalding, Spark…
• Producer.source(Platform) <- get data
• Producer – collection of events
• Transformations – map, filter, merge, leftJoin (lookup)
• Output – write(sink), sumByKey(store)
• Store – contains aggregate for each key, and reduce operation
Click to enter confidentiality information
28
Associative Reduce
Click to enter confidentiality information
29
WordCount SummingBird
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap { sentence =>
toWords(sentence).map(_ -> 1L)
}.sumByKey(store)
val stormTopology = Storm.remote(“stormName”).plan(wordCount)
val hadoopJob = Scalding(“scaldingName”).plan(wordCount)
Click to enter confidentiality information
30
SparkStreaming
31
First, there was the RDD
• Spark is its own execution engine
• With high-level API
• RDDs are sharded collections
• Can be mapped, reduced, grouped,
filtered, etc
32
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
33
DStream
DStream
DStreamSpark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
34
Compared to SummingBird
Differences:
• Micro-batches
• Completely new execution model
• Real joins
• Reduce is not limited to Monads
• SparkStreaming has Richer API
• Summingbird can aggregate batch
and stream to one dataset
• SparkStreaming runs in debugger
Similarities:
• Almost same code will run in batch
and streams
• Use of Scala
• Use of functional programing
concepts
Click to enter confidentiality information
35
Spark Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
36
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. ssc.start()
37
Apache Flink
38
Execution Model
You don’t want to know.
39
Flink vs SparkStreaming
Differences:
• Flink is event-by-event streaming,
events go through pipeline.
• SparkStreaming has good
integration with Hbase as state store
• “checkpoint barriers”
• Optimization based on strong typing
• Flink is newer than SparkStreaming,
there is less production experience
Similarities:
• Very similar APIs
• Built-in stream-specific operators
(windows)
• Exactly once guarantees through
checkpoints of offsets and state
(Flink is limited to small state for
now)
40
WordCount Batch
val env = ExecutionEnvironment.getExecutionEnvironment
val text = getTextDataSet(env)
val counts = text.flatMap { _.toLowerCase.split("W+") filter {
_.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)
41
WordCount Streaming
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream(host, port)
val counts = text.flatMap { _.toLowerCase.split("W+") filter {
_.nonEmpty } }
.map { (_, 1) } .groupBy(0)
.sum(1)
counts.print()
env.execute(“Wordcount Example”)
42
Bring Your Own
Framework
43
If the requirements are simple…
44
How difficult it is to parallelize
transformations?
Simple transformations
Are simple
45
Just add Kafka
Kafka is a reliable data source
You can read
Batches
Microbatches
Streams
Also allows for re-partitioning
Click to enter confidentiality information
46
Cluster management
• Managing cluster resources used to be difficult
• Now:
– YARN
– Mesos
– Docker
– Kubernetes
Click to enter confidentiality information
47
So your app should…
• Allocate resources and track tasks with YARN / Mesos
• Read from Kafka (however often you want)
• Do simple transformations
• Write to Kafka / Hbase
• How difficult can it possibly be?
Click to enter confidentiality information
48
Parting Thoughts
Click to enter confidentiality information
49
Good engineering lessons
• DRY – do you really need same code twice?
• Error correction is critical
• Reliability guarantees are critical
• Debuggers are really nice
• Latency / Throughput trade-offs
• Use existing expertise
• Stream processing is about patterns
Thank you

More Related Content

What's hot (20)

PDF
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
PDF
TriHUG Feb: Hive on spark
trihug
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
PDF
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
PPTX
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
PDF
Tachyon and Apache Spark
rhatr
 
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
PDF
Cloudera Impala
Scott Leberknight
 
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
PPTX
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
PPTX
Cost-based query optimization in Apache Hive
Julian Hyde
 
PPTX
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
PPTX
Node Labels in YARN
DataWorks Summit
 
PDF
Spark Summit EU talk by Steve Loughran
Spark Summit
 
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
TriHUG Feb: Hive on spark
trihug
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
DataWorks Summit
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
 
Tachyon and Apache Spark
rhatr
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
Cloudera Impala
Scott Leberknight
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
Powering a Virtual Power Station with Big Data
DataWorks Summit/Hadoop Summit
 
Cost-based query optimization in Apache Hive
Julian Hyde
 
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Node Labels in YARN
DataWorks Summit
 
Spark Summit EU talk by Steve Loughran
Spark Summit
 

Viewers also liked (20)

PDF
Complex Analytics using Open Source Technologies
DataWorks Summit
 
PDF
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
PPTX
Harnessing Hadoop Distuption: A Telco Case Study
DataWorks Summit
 
PDF
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
PDF
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
PDF
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
PDF
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
PDF
From Beginners to Experts, Data Wrangling for All
DataWorks Summit
 
PPTX
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
PPTX
Functional Programming and Big Data
DataWorks Summit
 
PDF
large scale collaborative filtering using Apache Giraph
DataWorks Summit
 
PDF
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
PPTX
Internet of Things Crash Course Workshop at Hadoop Summit
DataWorks Summit
 
PDF
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
PPTX
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
PDF
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PPTX
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Complex Analytics using Open Source Technologies
DataWorks Summit
 
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Harnessing Hadoop Distuption: A Telco Case Study
DataWorks Summit
 
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
 
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
From Beginners to Experts, Data Wrangling for All
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
Functional Programming and Big Data
DataWorks Summit
 
large scale collaborative filtering using Apache Giraph
DataWorks Summit
 
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...
DataWorks Summit
 
Internet of Things Crash Course Workshop at Hadoop Summit
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Dynamically Allocate Cluster Resources to your Spark Application
DataWorks Summit
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Ad

Similar to Have your Cake and Eat it Too - Architecture for Batch and Real-time processing (20)

PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PPTX
Apache Flink Training: System Overview
Flink Forward
 
PDF
Streaming architecture patterns
hadooparchbook
 
PPTX
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PPTX
Slim Baltagi – Flink vs. Spark
Flink Forward
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Apache Flink Training: System Overview
Flink Forward
 
Streaming architecture patterns
hadooparchbook
 
Introduction to Apache Flink at Vienna Meet Up
Stefan Papp
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Robert Metzger
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
Slim Baltagi – Flink vs. Spark
Flink Forward
 
Flink vs. Spark
Slim Baltagi
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flink Forward
 
Building end to end streaming application on Spark
datamantra
 
Apache Spark Components
Girish Khanzode
 
Introduction to Apache Flink
datamantra
 
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

  • 1. Have Your Cake and Eat It Too Architectures for Batch and Stream Processing Speaker name // Speaker title
  • 2. 2 Stuff We’ll Talk About • Why do we need both streams and batches • Why is it a problem? • Stream-Only Patterns (i.e. Kappa Architecture) • Lambda-Architecture Technologies – SummingBird – Apache Spark – Apache Flink – Bring-your-own-framework
  • 3. 3©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap About Me
  • 4. 4 Why Streaming and Batch ©2014 Cloudera, Inc. All rights reserved.
  • 5. 5 Batch Processing • Store data somewhere • Read large chunks of data • Do something with data • Sometimes store results
  • 6. 6 Batch Examples • Analytics • ETL / ELT • Training machine learning models • Recommendations Click to enter confidentiality information
  • 7. 7 Stream Processing • Listen to incoming events • Do something with each event • Maybe store events / results Click to enter confidentiality information
  • 8. 8 Stream Processing Examples • Anomaly detection, alerts • Monitoring, SLAs • Operational intelligence • Analytics, dashboards • ETL Click to enter confidentiality information
  • 9. 9 Streaming & Batch Click to enter confidentiality information Alerts Monitoring, SLAs Operational Intelligence Risk Analysis Anomaly detection Analytics ETL
  • 10. 10 Four Categories • Streams Only • Batch Only • Can be done in both • Must be done in both Click to enter confidentiality information ETL Some Analytics
  • 11. 11 ETL Most Stream Processing projects I see involve few simple transformations. • Currency conversion • JSON to Avro • Field extraction • Joining a stream to a static data set • Aggregate on window • Identifying change in trend • Document indexing Click to enter confidentiality information
  • 12. 12 Batch || Streaming • Efficient: – Lower CPU utilization – Better network and disk throughput – Fewer locks and waits • Easier administration • Easier integration with RDBMS • Existing expertise • Existing tools • Real-time information Click to enter confidentiality information
  • 13. 13 The Problem ©2014 Cloudera, Inc. All rights reserved.
  • 14. 14 We Like • Efficiency • Scalability • Fault Tolerance • Recovery from errors • Experimenting with different approaches • Debuggers • Cookies Click to enter confidentiality information
  • 15. 15 But… We don’t like Maintaining two applications That do the same thing Click to enter confidentiality information
  • 16. 16 Do we really need to maintain same app twice? Yes, because: • We are not sure about requirements • We sometimes need to re-process with very high efficiency Not really: • Different apps for batch and streaming • Can re-process with streams • Can error-correct with streams • Can maintain one code-base for batches and streams Click to enter confidentiality information
  • 18. 18 DWH Example Click to enter confidentiality information OLTP DB Sensors, Logs DWH Fact Table (Partitioned) Real Time Fact Tables Dimensio n Dimensio n Dimensio n Views Aggregat es App 1: Stream processing App 2: Occasional load
  • 19. 19 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Real-Time Table Replacement Partition Partitioned Fact Table
  • 20. 20 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v1 Streaming App v2 Real-Time Table Replacement Partition Partitioned Fact Table
  • 21. 21 We need to fix older data Click to enter confidentiality information 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Streaming App v2 Real-Time Table
  • 23. 23 WordCount in Scala source.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_+_) .print()
  • 25. 25 MapReduce was great because… Very simple abstraction: - Map - Shuffle - Reduce - Type-safe And it has simpler abstractions on top.
  • 26. 26 SummingBird • Multi-stage MapReduce • Run on Hadoop, Spark, Storm • Very easy to combine batch and streaming results Click to enter confidentiality information
  • 27. 27 API • Platform – Storm, Scalding, Spark… • Producer.source(Platform) <- get data • Producer – collection of events • Transformations – map, filter, merge, leftJoin (lookup) • Output – write(sink), sumByKey(store) • Store – contains aggregate for each key, and reduce operation Click to enter confidentiality information
  • 28. 28 Associative Reduce Click to enter confidentiality information
  • 29. 29 WordCount SummingBird def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) val stormTopology = Storm.remote(“stormName”).plan(wordCount) val hadoopJob = Scalding(“scaldingName”).plan(wordCount) Click to enter confidentiality information
  • 31. 31 First, there was the RDD • Spark is its own execution engine • With high-level API • RDDs are sharded collections • Can be mapped, reduced, grouped, filtered, etc
  • 32. 32 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 33. 33 DStream DStream DStreamSpark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  • 34. 34 Compared to SummingBird Differences: • Micro-batches • Completely new execution model • Real joins • Reduce is not limited to Monads • SparkStreaming has Richer API • Summingbird can aggregate batch and stream to one dataset • SparkStreaming runs in debugger Similarities: • Almost same code will run in batch and streams • Use of Scala • Use of functional programing concepts Click to enter confidentiality information
  • 35. 35 Spark Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 36. 36 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. ssc.start()
  • 39. 39 Flink vs SparkStreaming Differences: • Flink is event-by-event streaming, events go through pipeline. • SparkStreaming has good integration with Hbase as state store • “checkpoint barriers” • Optimization based on strong typing • Flink is newer than SparkStreaming, there is less production experience Similarities: • Very similar APIs • Built-in stream-specific operators (windows) • Exactly once guarantees through checkpoints of offsets and state (Flink is limited to small state for now)
  • 40. 40 WordCount Batch val env = ExecutionEnvironment.getExecutionEnvironment val text = getTextDataSet(env) val counts = text.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.print() env.execute(“Wordcount Example”)
  • 41. 41 WordCount Streaming val env = ExecutionEnvironment.getExecutionEnvironment val text = env.socketTextStream(host, port) val counts = text.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.print() env.execute(“Wordcount Example”)
  • 43. 43 If the requirements are simple…
  • 44. 44 How difficult it is to parallelize transformations? Simple transformations Are simple
  • 45. 45 Just add Kafka Kafka is a reliable data source You can read Batches Microbatches Streams Also allows for re-partitioning Click to enter confidentiality information
  • 46. 46 Cluster management • Managing cluster resources used to be difficult • Now: – YARN – Mesos – Docker – Kubernetes Click to enter confidentiality information
  • 47. 47 So your app should… • Allocate resources and track tasks with YARN / Mesos • Read from Kafka (however often you want) • Do simple transformations • Write to Kafka / Hbase • How difficult can it possibly be? Click to enter confidentiality information
  • 48. 48 Parting Thoughts Click to enter confidentiality information
  • 49. 49 Good engineering lessons • DRY – do you really need same code twice? • Error correction is critical • Reliability guarantees are critical • Debuggers are really nice • Latency / Throughput trade-offs • Use existing expertise • Stream processing is about patterns

Editor's Notes

  • #4: This gives me a lot of perspective regarding the use of Hadoop
  • #29: Algebird has tons of associative reducers