Kafka
Kafka
Streaming Spark
• Multiple streams
• Different rates, not synchronized
• Archival store
• Offline analysis, not real-time
• Working store
• Disk or memory
• Summaries
• Parts of streams
• Queries
• Standing queries
• Ad-hoc queries
BIG DATA
Examples of Stream Queries
• Velocity
• Streams can have high data rate
• Need to process very fast
• Volume
• Low data rate, but large number of streams
• Ocean sensors, pollution sensors
• Need to store in memory
• May not have huge memory
• Approximate solutions
BIG DATA
Need for a framework …
• Any company who wants to process live streaming data has this problem
• Twice the effort to implement any new function
• Twice the number of bugs to solve
• Twice the headache Custom-built distributed stream processing system
• 1000s complex metrics on millions of video
sessions
• Requires many dozens of nodes for processing
• Two processing stacks
Hadoop backend for offline analysis
• Generating daily and monthly reports
• Similar computation as the streaming system
BIG DATA
Requirements
Max_num_stock
BIG DATA
Existing Streaming Systems
• Storm
• Replays record if not processed by a node
• Processes each record at least once
• May update mutable state twice!
• Mutable state can be lost due to failure!
https://storm.apache.org/
https://storm.apache.org/releases/current/Trident-tutorial.html
BIG DATA
Requirements
Max_num_stock
BIG DATA
Exercise 4: Solution
Max_num_stock
BIG DATA
Discretized Stream Processing
Spark
processed results
Streaming Spark - DStreams
BIG DATA
Remember!!!!
tweets DStream
new DStream transformation: modify data in one Dstream to create another DStream
hashTags.saveAsHadoopFiles("hdfs://...")
hashTags
DStream
Every batch saved to save save save
HDFS
BIG DATA
Java Example
Scala
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Java
JavaDStream<Status> tweets = ssc.twitterStream(<Twitter username>,
<Twitter password>)
JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })
hashTags.saveAsHadoopFiles("hdfs://...")
• Stateless transformations
• Stateful Transformations
Spark Streaming – stateless processing
BIG DATA
Stateless transformations in Spark
• Examples
• Map()
• FlatMap()
• Filter()
• Repartition()
• reduceByKey()
• groupByKey()
BIG DATA
Class Exercise : Stateless stream processing (10 mins)
countByValue
tagCounts count over all
the data in the
window
BIG DATA
Class Exercise: (5 mins)
54
BIG DATA
Session based state
▪ Consider the code to the right ▪ Maintaining arbitrary state, track sessions
▪ What has to be the structure of the RDD - Maintain per-user mood as state, and
tweets? update it with his/her tweets
▪ Hint – note that updateStateByKey tweets.updateStateByKey(tweet =>
needs a key updateMood(tweet))
▪ What does the function updateMood do?
▪ Hint – note that it should update
per-user mood
BIG DATA
Exercise 4 – Maintaining State: solution
▪ What has to be the structure of the RDD ▪ Maintaining arbitrary state, track sessions
tweets? - Maintain per-user mood as state, and
▪ Must consist of key value pairs with update it with his/her tweets
user as key and mood as value
tweets.updateStateByKey(tweet =>
▪ Dinkar Happy updateMood(tweet))
▪ KVS VeryHappy
▪ …
▪ What does the function updateMood do?
▪ Compute the new mood based upon
the old mood and tweet
▪ Suppose user KVS (key) tweets “Eating
icecream” (value)
BIG DATA
Exercise 4 – Maintaining State: solution
▪ Consider the code to the right ▪ Maintaining arbitrary state, track sessions
▪ What has to be the structure of the RDD tweets?
- Maintain per-user mood as state, and update it
▪ Must consist of key value pairs with user as key and mood as
value with his/her tweets
▪ Dinkar Happy
▪ KVS VeryHappy tweets.updateStateByKey(tweet =>
▪ …
updateMood(tweet))
▪ Suppose user Dinkar (key) tweets “Eating icecream”
(value)
▪ updateStateByKey finds the current mood – Happy
▪ current mood (Happy) and tweet (Eating icecream) is passed
to updateMood
▪ updateMood calculates new mood as VeryHappy
▪ updateStateByKey stores the new mood for Dinkar as VeryHappy
Fault Tolerant stateful processing
BIG DATA
Fault-tolerant Stateful Processing
67
BIG DATA
Fast Fault Recovery
69
BIG DATA
Real Applications: Conviva
processed
• Scales linearly with cluster
size
# Nodes in Cluster 70
BIG DATA
Real Applications: Mobile Millennium Project
observations
• Very CPU intensive, requires
dozens of machines for useful
computation
• Scales linearly with cluster size
71
# Nodes in Cluster
Putting it all together
BIG DATA
Vision - one stack to rule them all
BIG DATA
Spark program vs Spark Streaming program
Streaming to identify
problems in live log streams
BIG DATA
Vision - one stack to rule them all
• Give it a spin!
• Run locally or in a cluster
• https://spark.apache.org/docs/latest/streaming-programming-guide.html
• https://spark.apache.org/streaming/
• Mining of Massive Datasets, Anand Rajaraman, Jure Leskovec, Jeffrey D. Ullman
• Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark,
and More Hadoop Alternatives, Vijay Srinivasa Agneeswaran
THANK YOU