SlideShare a Scribd company logo
spark streaming 
with C* 
jacek.lewandowski@datastax.com
…applies where you need 
near-realtime data analysis
Spark vs Spark Streaming 
zillions of bytes gigabytes per second 
static dataset 
stream of data
What can you do with it? 
applications sensors web mobile phones 
intrusion detection malfunction detection site analytics network metrics analysis 
fraud detection 
dynamic process 
optimisation 
recommendations location based ads 
log processing supply chain planning sentiment analysis spying
What can you do with it? 
applications sensors web mobile phones 
intrusion detection malfunction detection site analytics network metrics analysis 
fraud detection 
dynamic process 
optimisation 
recommendations location based ads 
log processing supply chain planning sentiment analysis spying
Almost 
Whatever 
Source 
You 
Want 
Almost 
Whatever 
Destination 
You 
Want
Spark Streaming with Cassandra
Spark Streaming with Cassandra
so, let’s see how it works
DStream - A continuous sequence 
of micro batches 
DStream 
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) 
Processing of DStream = Processing of μBatches, RDDs
9 8 7 6 5 4 3 2 1 Receiver Interface between different 
stream sources and Spark
9 8 7 6 5 4 3 2 1 Receiver 
Spark memory boundary 
Block Manager 
Interface between different 
stream sources and Spark
9 8 7 6 5 4 3 2 1 Receiver 
Spark memory boundary 
Block Manager 
Replication and 
building μBatches 
Interface between different 
stream sources and Spark
Spark memory boundary Block Manager
Spark memory boundary Block Manager 
Blocks of input data 
9 8 7 6 5 4 3 2 1
Spark memory boundary Block Manager 
Blocks of input data 
9 8 7 6 5 4 3 2 1 
μBatch made of blocks 
9 8 7 6 5 4 3 2 1
μBatch made of blocks 
9 8 7 6 5 4 3 2 1
μBatch made of blocks 
9 8 7 6 5 4 3 2 1 
Partition Partition Partition
μBatch made of blocks 
9 8 7 6 5 4 3 2 1 
Partition Partition Partition
Ingestion from multiple sources 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
Receiving, 
μBatch building
Ingestion from multiple sources 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
Receiving, 
μBatch building 
μBatch μBatch 
2s 1s 0s
A well-worn example 
• ingestion of text messages 
• splitting them into separate words 
• count the occurrence of words within 5 
seconds windows 
• save word counts from the last 5 seconds, 
every 5 second to Cassandra, and display the 
first few results on the console
how to do that ? 
well…
Yes, it is that easy 
case class WordCount(time: Long, word: String, count: Int) 
val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} 
val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) 
val wordCounts: DStream[(String, Long)] = words.countByValue() 
val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => 
val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { 
case (word, count) => 
(count.toInt, WordCount(time.milliseconds, word, count.toInt)) 
} 
val topWordCountsRDD: RDD[WordCount] = mappedWordCounts 
.sortByKey(ascending = false).values 
) 
topWordsStream.saveToCassandra("meetup", "word_counts") 
topWordsStream.print()
DStream stateless operators 
(quick recap) 
• map 
• flatMap 
• filter 
• repartition 
• union 
• count 
• countByValue 
• reduce 
• reduceByKey 
• joins 
• cogroup 
• transform 
• transformWith
DStream[Bean].count() 
count 4 3 
1s 1s 1s 1s
DStream[Bean].count() 
count 4 3 
1s 1s 1s 1s
DStream[Orange].union(DStream[Apple]) 
union 
1s 1s
Other stateless operations 
• join(DStream[(K, W)]) 
• leftOuterJoin(DStream[(K, W)]) 
• rightOuterJoin(DStream[(K, W)]) 
• cogroup(DStream[(K, W)]) 
are applied on pairs of corresponding μBatches
transform, transformWith 
• DStream[T].transform(RDD[T] => RDD[U]): DStream[U] 
• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V] 
allow you to create new stateless operators
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith 
(DStream[Red], …): DStream[Violet] 
1-A 2-A 3-A 
1-B 2-B 3-B 
1-A x 1-B 2-A x 2-B 3-A x 3-B
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
0s 1s 2s 3s 4s 5s 6s 7s 
By default: 
window = slide = μBatch duration 
window
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
0s 1s 2s 3s 4s 5s 6s 7s 
The resulting DStream consists of 3 seconds μBatches 
! 
Each resulting μBatch overlaps the preceding one by 1 second
Windowing 
slide 
window 
1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 
μBatch appears in output stream every 1s 
! 
It contains messages collected during 3s 
1s
Windowing 
slide 
window 
1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 
μBatch appears in output stream every 1s 
! 
It contains messages collected during 3s 
1s
DStream window operators 
• window(Duration, Duration) 
• countByWindow(Duration, Duration) 
• reduceByWindow(Duration, Duration, (T, T) => T) 
• countByValueAndWindow(Duration, Duration) 
• groupByKeyAndWindow(Duration, Duration) 
• reduceByKeyAndWindow((V, V) => V, Duration, Duration)
Let’s modify the example 
• ingestion of text messages 
• splitting them into separate words 
• count the occurrence of words within 10 
seconds windows 
• save word counts from the last 10 seconds, 
every 2 second to Cassandra, and display the 
first few results on the console
Yes, it is still easy to do 
case class WordCount(time: Long, word: String, count: Int) 
val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} 
val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) 
val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) 
val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => 
val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { 
case (word, count) => 
(count.toInt, WordCount(time.milliseconds, word, count.toInt)) 
} 
val topWordCountsRDD: RDD[WordCount] = mappedWordCounts 
.sortByKey(ascending = false).values 
) 
topWordsStream.saveToCassandra("meetup", "word_counts") 
topWordsStream.print()
DStream stateful operator 
• DStream[(K, V)].updateStateByKey 
(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] 
A 
1 
B 
2 
A 
3 
C 
4 
A 
5 
B 
6 
A 
7 
B 
8 
C 
9 
• R1 = f(Seq(1, 3, 5), Some(7)) 
• R2 = f(Seq(2, 6), Some(8)) 
• R3 = f(Seq(4), Some(9)) 
A 
R1 
B 
R2 
C 
R3
Total word count example 
case class WordCount(time: Long, word: String, count: Int) 
def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { 
val sum = counts.sum 
Some(state.getOrElse(0L) + sum) 
} 
val totalWords: DStream[(String, Long)] = 
stream.map { case (_, paragraph) => paragraph} 
.flatMap(_.split( """s+""")) 
.countByValue() 
.updateStateByKey(update) 
val topTotalWordCounts: DStream[WordCount] = 
totalWords.transform((rdd, time) => 
rdd.map { case (word, count) => 
(count, WordCount(time.milliseconds, word, count.toInt)) 
}.sortByKey(ascending = false).values 
) 
topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") 
topTotalWordCounts.print()
Obtaining DStreams 
• ZeroMQ 
• Kinesis 
• HDFS compatible file system 
• Akka actor 
• Twitter 
• MQTT 
• Kafka 
• Socket 
• Flume 
• …
Particular DStreams 
are available in separate modules 
GroupId ArtifactId Latest Version 
org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 
org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 
org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) 
org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
If something goes wrong…
Fault tolerance 
The sequence 
of transformations is known 
to Spark Streaming 
μBatches are replicated 
once they are received 
Lost data can be recomputed
But there are pitfalls 
• Spark replicates blocks, not single messages 
• It is up to a particular receiver to decide whether to form the block from a 
single message or to collect more messages before pushing the block 
• The data collected in the receiver before the block is pushed will be lost in 
case of failure of the receiver 
• Typical tradeoff - efficiency vs fault tolerance
Built-in receivers breakdown 
Pushing single 
messages 
Can do both Pushing whole blocks 
Kafka Akka RawNetworkReceiver 
Twitter Custom ZeroMQ 
Socket 
MQTT
Thank you ! 
Questions? 
! 
http://spark.apache.org/ 
https://github.com/datastax/spark-cassandra-connector 
http://cassandra.apache.org/ 
http://www.datastax.com/

More Related Content

What's hot (20)

ODP
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
Spark Cassandra Connector Dataframes
Russell Spitzer
 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
PDF
An Introduction to time series with Team Apache
Patrick McFadin
 
PDF
Time series with apache cassandra strata
Patrick McFadin
 
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
OLAP with Cassandra and Spark
Evan Chan
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PDF
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
PDF
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
PDF
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
PDF
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Time Series Processing with Apache Spark
Josef Adersberger
 
Spark Cassandra Connector Dataframes
Russell Spitzer
 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
 
An Introduction to time series with Team Apache
Patrick McFadin
 
Time series with apache cassandra strata
Patrick McFadin
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
OLAP with Cassandra and Spark
Evan Chan
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...
StampedeCon
 
Apache cassandra and spark. you got the the lighter, let's start the fire
Patrick McFadin
 
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
 
Cassandra Basics, Counters and Time Series Modeling
Vassilis Bekiaris
 
Spark with Cassandra by Christopher Batey
Spark Summit
 

Viewers also liked (18)

PDF
Cassandra & Spark for IoT
Matthias Niehoff
 
PDF
Cassandra and IoT
Russell Spitzer
 
PPTX
Spark vs storm
Trong Ton
 
PPTX
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
DataStax
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
The biodegradation of Polystyrene
Pat Pataranutaporn
 
PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
PPTX
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
DataStax
 
PDF
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Guido Schmutz
 
PPTX
Introduction To HBase
Anil Gupta
 
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
PPTX
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
PDF
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
TEDx Manchester: AI & The Future of Work
Volker Hirsch
 
Cassandra & Spark for IoT
Matthias Niehoff
 
Cassandra and IoT
Russell Spitzer
 
Spark vs storm
Trong Ton
 
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
DataStax
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
The biodegradation of Polystyrene
Pat Pataranutaporn
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Guido Schmutz
 
Real-time Data Integration with Kafka and Cassandra (Ewen Cheslack-Postava, C...
DataStax
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Guido Schmutz
 
Introduction To HBase
Anil Gupta
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Spark Summit
 
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Hadoop Summit Europe 2014: Apache Storm Architecture
P. Taylor Goetz
 
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
TEDx Manchester: AI & The Future of Work
Volker Hirsch
 
Ad

Similar to Spark Streaming with Cassandra (20)

POTX
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
PDF
Apache Spark Overview part2 (20161117)
Steve Min
 
PDF
Productionizing your Streaming Jobs
Databricks
 
PPT
Introduction to Spark Streaming
Knoldus Inc.
 
PDF
So you think you can stream.pptx
Prakash Chockalingam
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
PDF
Small intro to Big Data - Old version
SoftwareMill
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
PDF
Introduction to Structured streaming
datamantra
 
PPTX
Stream processing - Apache flink
Renato Guimaraes
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
PPTX
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
PDF
JDD 2016 - Michal Matloka - Small Intro To Big Data
PROIDEA
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Spark streaming
Noam Shaish
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
PPT
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Apache Spark Overview part2 (20161117)
Steve Min
 
Productionizing your Streaming Jobs
Databricks
 
Introduction to Spark Streaming
Knoldus Inc.
 
So you think you can stream.pptx
Prakash Chockalingam
 
Spark streaming: Best Practices
Prakash Chockalingam
 
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Stratio
 
Small intro to Big Data - Old version
SoftwareMill
 
Deep dive into spark streaming
Tao Li
 
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Big Data Spain
 
Introduction to Structured streaming
datamantra
 
Stream processing - Apache flink
Renato Guimaraes
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Trivento summercamp fast data 9/9/2016
Stavros Kontopoulos
 
JDD 2016 - Michal Matloka - Small Intro To Big Data
PROIDEA
 
Introduction to Flink Streaming
datamantra
 
Spark streaming
Noam Shaish
 
strata_spark_streaming.ppt
AbhijitManna19
 
strata spark streaming strata spark streamingsrata spark streaming
ShidrokhGoudarzi1
 
Ad

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
big data eco system fundamentals of data science
arivukarasi
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
What Is Data Integration and Transformation?
subhashenia
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
BinarySearchTree in datastructures in detail
kichokuttu
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 

Spark Streaming with Cassandra

  • 2. …applies where you need near-realtime data analysis
  • 3. Spark vs Spark Streaming zillions of bytes gigabytes per second static dataset stream of data
  • 4. What can you do with it? applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis spying
  • 5. What can you do with it? applications sensors web mobile phones intrusion detection malfunction detection site analytics network metrics analysis fraud detection dynamic process optimisation recommendations location based ads log processing supply chain planning sentiment analysis spying
  • 6. Almost Whatever Source You Want Almost Whatever Destination You Want
  • 9. so, let’s see how it works
  • 10. DStream - A continuous sequence of micro batches DStream μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD) Processing of DStream = Processing of μBatches, RDDs
  • 11. 9 8 7 6 5 4 3 2 1 Receiver Interface between different stream sources and Spark
  • 12. 9 8 7 6 5 4 3 2 1 Receiver Spark memory boundary Block Manager Interface between different stream sources and Spark
  • 13. 9 8 7 6 5 4 3 2 1 Receiver Spark memory boundary Block Manager Replication and building μBatches Interface between different stream sources and Spark
  • 14. Spark memory boundary Block Manager
  • 15. Spark memory boundary Block Manager Blocks of input data 9 8 7 6 5 4 3 2 1
  • 16. Spark memory boundary Block Manager Blocks of input data 9 8 7 6 5 4 3 2 1 μBatch made of blocks 9 8 7 6 5 4 3 2 1
  • 17. μBatch made of blocks 9 8 7 6 5 4 3 2 1
  • 18. μBatch made of blocks 9 8 7 6 5 4 3 2 1 Partition Partition Partition
  • 19. μBatch made of blocks 9 8 7 6 5 4 3 2 1 Partition Partition Partition
  • 20. Ingestion from multiple sources Receiving, μBatch building Receiving, μBatch building Receiving, μBatch building
  • 21. Ingestion from multiple sources Receiving, μBatch building Receiving, μBatch building Receiving, μBatch building μBatch μBatch 2s 1s 0s
  • 22. A well-worn example • ingestion of text messages • splitting them into separate words • count the occurrence of words within 5 seconds windows • save word counts from the last 5 seconds, every 5 second to Cassandra, and display the first few results on the console
  • 23. how to do that ? well…
  • 24. Yes, it is that easy case class WordCount(time: Long, word: String, count: Int) val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue() val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values ) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
  • 25. DStream stateless operators (quick recap) • map • flatMap • filter • repartition • union • count • countByValue • reduce • reduceByKey • joins • cogroup • transform • transformWith
  • 29. Other stateless operations • join(DStream[(K, W)]) • leftOuterJoin(DStream[(K, W)]) • rightOuterJoin(DStream[(K, W)]) • cogroup(DStream[(K, W)]) are applied on pairs of corresponding μBatches
  • 30. transform, transformWith • DStream[T].transform(RDD[T] => RDD[U]): DStream[U] • DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V] allow you to create new stateless operators
  • 31. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 32. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 33. DStream[Blue].transformWith (DStream[Red], …): DStream[Violet] 1-A 2-A 3-A 1-B 2-B 3-B 1-A x 1-B 2-A x 2-B 3-A x 3-B
  • 34. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 35. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 36. Windowing slide 0s 1s 2s 3s 4s 5s 6s 7s By default: window = slide = μBatch duration window
  • 37. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 38. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 39. Windowing slide window 0s 1s 2s 3s 4s 5s 6s 7s The resulting DStream consists of 3 seconds μBatches ! Each resulting μBatch overlaps the preceding one by 1 second
  • 40. Windowing slide window 1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 μBatch appears in output stream every 1s ! It contains messages collected during 3s 1s
  • 41. Windowing slide window 1 2 3 4 5 6 7 8 window 1 2 3 4 5 6 3 4 5 6 7 8 μBatch appears in output stream every 1s ! It contains messages collected during 3s 1s
  • 42. DStream window operators • window(Duration, Duration) • countByWindow(Duration, Duration) • reduceByWindow(Duration, Duration, (T, T) => T) • countByValueAndWindow(Duration, Duration) • groupByKeyAndWindow(Duration, Duration) • reduceByKeyAndWindow((V, V) => V, Duration, Duration)
  • 43. Let’s modify the example • ingestion of text messages • splitting them into separate words • count the occurrence of words within 10 seconds windows • save word counts from the last 10 seconds, every 2 second to Cassandra, and display the first few results on the console
  • 44. Yes, it is still easy to do case class WordCount(time: Long, word: String, count: Int) val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph} val words: DStream[String] = paragraphs.flatMap(_.split( """s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values ) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
  • 45. DStream stateful operator • DStream[(K, V)].updateStateByKey (f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] A 1 B 2 A 3 C 4 A 5 B 6 A 7 B 8 C 9 • R1 = f(Seq(1, 3, 5), Some(7)) • R2 = f(Seq(2, 6), Some(8)) • R3 = f(Seq(4), Some(9)) A R1 B R2 C R3
  • 46. Total word count example case class WordCount(time: Long, word: String, count: Int) def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum) } val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()
  • 47. Obtaining DStreams • ZeroMQ • Kinesis • HDFS compatible file system • Akka actor • Twitter • MQTT • Kafka • Socket • Flume • …
  • 48. Particular DStreams are available in separate modules GroupId ArtifactId Latest Version org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0 org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7) org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7) org.apache.spark spark-streaming-flume-sink_2.10 1.1.0 org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7) org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
  • 49. If something goes wrong…
  • 50. Fault tolerance The sequence of transformations is known to Spark Streaming μBatches are replicated once they are received Lost data can be recomputed
  • 51. But there are pitfalls • Spark replicates blocks, not single messages • It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block • The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver • Typical tradeoff - efficiency vs fault tolerance
  • 52. Built-in receivers breakdown Pushing single messages Can do both Pushing whole blocks Kafka Akka RawNetworkReceiver Twitter Custom ZeroMQ Socket MQTT
  • 53. Thank you ! Questions? ! http://spark.apache.org/ https://github.com/datastax/spark-cassandra-connector http://cassandra.apache.org/ http://www.datastax.com/