Chapter-5 Stream Processing Part1
Chapter-5 Stream Processing Part1
Stream Processing
Today‘s Lesson
• Streaming Methods
• Sliding Windows & Ageing
• Data Synopsis
Data Streams
Data Streams
• Definition:
A data stream can be seen as a continuous and potentially
infinite stochastic process in which events occur indepen-
dently from another
• Single scan
Data Streams
Stream Processor
Output
Streams
Standing
query
time
Limited
working
storage
Archival Storage
Stream Stream
Processor Processor
time
time
time
1 5 11 4 4 5 11 4
Streaming Methods
• Solutions:
• Storing summaries of previously seen data
• „Forgetting“ stale data
• Timestamp-based:
Data Stream
• Sequence-based:
Data Stream
Data Stream
Streaming Methods
• Solutions:
• Data reduction
• Data approximation
DWT
combination of basic wavelet 0 20 40 60 80 100 120 140
functions Haar 0
coefficents Haar 7
Example:
Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1]
h1= 4 = h2 = 2 = h3 = 2 = h4 = -1 =
X = {8, 4, 1, 3} mean(8,4,1, mean(8,4) - (8-4)/2 (1-3)/2
8 3) h1
7
6
5
4
3
2
1
h1 = 4 h2 = 2 h3 = 2 h4 = -1 X = {8, 4, 1, 3}
8
7
6
5
4
3
2
1
Spark Streaming
Spark Streaming
Spark Streaming
Spark Streaming
Spark Streaming
#Create a local StreamingContext with two working threads and batch
#interval of 1 sec
sc = SparkContext(“local[2]“,“NetworkWordCount“)
ssc = StreamingContext(sc, 1)
#Create a DStream that will connect to localhost:9999
lines = ssc.socketTextStream(“localhost“, 9999)
#Split each line into words
words = lines.flatMap(lambda line: line.split(“ “))
#Count each word in each batch
pairs = words.map(lambda word: (word,1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#Print the first ten elements of each RDD of this DStream to the console
wordCounts.pprint()
#Start the computation and wait for it to terminate
ssc.start()
ssc.awaitTermination()
Big Data Management and Analytics 217
DATABASE
SYSTEMS
Stream Processing
GROUP
Spark Streaming
Apache Storm
• Three abstractions:
– Spouts
– Bolts
– Topologies
Apache Storm
• Spouts:
– Source of streams
– Typically reads from queuing brokers (e.g. Kafka, RabbitMQ)
– Can also generate its own data or read from external sources (e.g.
Twitter)
• Bolts:
– Processes any number of input streams
– Produces any number of output streams
– Holds most of the logic of the computations (functions, filters,…)
Apache Storm
• Topologies:
– Network of spouts and bolts
– Each edge represents a bolt subscribing to the output stream of
some other spout or bolt
– A topology is an arbitrarily complex multi-stage stream computation
Apache Storm
• Streams:
– Core abstraction in Storm
– A stream is an unbounded sequence of tuples that is
processed and created in parallel in a distributed fashion
– Tuples can contain standard types like integers, floats,
shorts, booleans, strings and so on
– Custom types can be used if a own serializer is defined
– A stream grouping defines how that stream should be
partitioned among the bolt's tasks
Big Data Management and Analytics 222
DATABASE
SYSTEMS
Stream Processing
GROUP
Apache Storm
Spout Bolt Bolt
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes
Further Reading