Spark Streaming: Tathagata "TD" Das
Spark Streaming: Tathagata "TD" Das
Tathagata Das ( TD )
Whoami
> Storm
- Replays record if not processed by a node
- Processes each record at least once
- May update mutable state twice!
- Mutable state can be lost due to failure!
9
treaming
What is Spark Streaming?
> Receive data streams from input sources, process
them in a cluster, push out to databases/
dashboards
> Scalable, fault-tolerant, second-scale latencies
Kafka
Flume HDFS
HDFS Databases
Kinesis treaming Dashboards
Twitter
How does Spark Streaming work?
> Chop up data streams into batches of few secs
> Spark treats each batch of data as RDDs and
processes them using RDD operations
> Processed results are pushed out in batches
treamin
g
Receivers
data streams
batches as results as
RDDs RDDs
Spark Streaming Programming Model
> Discretized Stream (DStream)
- Represents a stream of data
- Implemented as a sequence of RDDs
> DStreams API very similar to RDD API
- Functional APIs in Scala, Java
- Create input DStreams from different sources
- Apply parallel operations
Example – Get hashtags from Twitter
val
ssc
=
new
StreamingContext(sparkContext,
Seconds(1))
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
Input
DStream
tweets DStream
transformed
transforma0on:
modify
data
in
one
DStream
DStream
to
create
another
DStream
tweets DStream
hashTags
Dstream
…
new
RDDs
created
[#cat,
#dog,
…
]
for
every
batch
Example – Get hashtags from Twitter
val
tweets
=
TwitterUtils.createStream(ssc,
None)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
hashTags
DStream
save save save
every
batch
saved
to
HDFS
Example – Get hashtags from Twitter
val
tweets
=
TwitterUtils.createStream(ssc,
None)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
hashTags.foreachRDD(hashTagRDD
=>
{
...
})
hashTags
DStream
foreach foreach foreach
hashTags.saveAsHadoopFiles("hdfs://...”)
Java API
FuncOon
object
JavaDStream<Status>
tweets
=
ssc.twitterStream()
JavaDstream<String>
hashTags
=
tweets.flatMap(new
Function<...>
{
})
hashTags.saveAsHadoopFiles("hdfs://...")
Python API
...soon
Window-based Transformations
val
tweets
=
TwitterUtils.createStream(ssc,
auth)
val
hashTags
=
tweets.flatMap(status
=>
getTags(status))
val
tagCounts
=
hashTags.window(Minutes(1),
Seconds(5)).countByValue()
sliding
window
window
length
sliding
interval
operaOon
window length
tweets.transform(tweetsRDD
=>
{
tweetsRDD.join(spamFile).filter(...)
})
DStreams + RDDs = Power
> Combine live data streams with historical data
- Generate historical data models with Spark, etc.
- Use data models to process live data stream
Apache Spark
interactively to
...
scala>
val
filtered
=
file.filter(_.contains(“ERROR”))
identify problems
...
scala>
val
mapped
=
filtered.map(...)
...
object
ProcessProductionData
{
def
main(args:
Array[String])
{
val
sc
=
new
SparkContext(...)
> Use same code in
val
file
=
sc.hadoopFile(“productionLogs”)
val
filtered
=
file.filter(_.contains(“ERROR”))
Spark for processing
val
mapped
=
filtered.map(...)
...
large logs
}
}
object
ProcessLiveStream
{
def
main(args:
Array[String])
{
val
sc
=
new
StreamingContext(...)
val
stream
=
KafkaUtil.createStream(...)
> Use similar code in
val
filtered
=
stream.filter(_.contains(“ERROR”))
val
mapped
=
filtered.map(...)
Spark Streaming for
...
}
realtime processing
}
Performance
Can process 60M records/sec (6 GB/sec) on
100 nodes at sub-second latency
7
3.5
WordCount
Cluster
Thhroughput
(GB/s)
flatMap
> Data lost due to worker
failure, can be recomputed
from replicated input data
hashTags
RDD
lost
parOOons
recomputed
on
> All transformations are fault- other
workers
tolerant, and exactly-once
transformations
Input Sources
• Out of the box, we provide
- Kafka, Flume, Kinesis, Raw TCP sockets, HDFS, etc.
Output Sinks
• HDFS, S3, etc (Hadoop API compatible filesystems)
• Cassandra (using Spark-Cassandra connector)