0% found this document useful (0 votes)
24 views

Chapter-5 Stream Processing Part1

Chapter 5 discusses stream processing in big data management, covering data streams, stream models, and various streaming methods. It highlights key characteristics of data streams, introduces data stream management systems, and explains tools like Apache Spark Streaming and Apache Storm for real-time processing. The chapter also addresses challenges in processing high-velocity data and techniques for data reduction and approximation.

Uploaded by

hplamd5000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Chapter-5 Stream Processing Part1

Chapter 5 discusses stream processing in big data management, covering data streams, stream models, and various streaming methods. It highlights key characteristics of data streams, introduces data stream management systems, and explains tools like Apache Spark Streaming and Apache Storm for real-time processing. The chapter also addresses challenges in processing high-velocity data and techniques for data reduction and approximation.

Uploaded by

hplamd5000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 5:

Stream Processing

Big Data Management and Analytics 193


DATABASE
SYSTEMS
Stream Processing
GROUP

Today‘s Lesson

• Data Streams & Data Stream Management System

• Data Stream Models


• Insert-Only
• Insert-Delete
• Additive

• Streaming Methods
• Sliding Windows & Ageing
• Data Synopsis

• Stream Processing – Concepts & Tools


• Micro-Batching with Apache Spark Streaming
• Real-time Stream Processing with Apache Storm

Big Data Management and Analytics 194


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams

Big Data Management and Analytics 195


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams

• Definition:
A data stream can be seen as a continuous and potentially
infinite stochastic process in which events occur indepen-
dently from another

• Huge amount of data


→ Data objects cannot be stored

• Single scan

Big Data Management and Analytics 196


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams – Key Characteristics

• The data elements in the stream arrive on-line

• The system has no control over the order in which data


elements arrive (either within a data stream or across
multiple data streams)

• Data streams are potentially unbound in size

• Once an element has been processed it is discarded or


archived

Big Data Management and Analytics 197


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Management System


Ad-hoc queries

Data Streams
Stream Processor
Output
Streams
Standing
query
time

Limited
working
storage
Archival Storage

Big Data Management and Analytics 198


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Insert-Only Model

• Once an element is seen, it cannot be changed

Stream Stream
Processor Processor

time

Big Data Management and Analytics 199


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Insert-Delete Model

• Elements can be deleted or updated


3 4 3
2 4 Stream 2 Stream
Processor Processor
4
4

time

Big Data Management and Analytics 200


DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Additive Model

• Each element is an increment to the previous


version of the given data object
2 3 2
Stream Stream
Processor Processor
3

time

1 5 11 4 4 5 11 4

Big Data Management and Analytics 201


DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods

• Huge amount of data vs. limited resources in


space → impractical to store all data

• Solutions:
• Storing summaries of previously seen data
• „Forgetting“ stale data

• But: Trade-off between storage space and the


ability to provide precise query answers
Big Data Management and Analytics 202
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sliding Windows

• Idea: Keep most recent stream elements in main


memory and discard older ones

• Timestamp-based:

Data Stream

Sliding interval Window length

Big Data Management and Analytics 203


DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sliding Windows

• Idea: Keep most recent stream elements in main


memory and discard older ones

• Sequence-based:

Data Stream

Sliding interval Window length

Big Data Management and Analytics 204


DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Ageing

• Idea: Keep only the summary in main memory and


discard objects as soon as they are processed

Data Stream

• Multiply the summary with a decay factor after


each time epoche, resp. after a certain amount of
occuring elements

Big Data Management and Analytics 205


DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods

• High velocity of incoming data vs. limited resour-


ces in time → impossible to process all data

• Solutions:
• Data reduction
• Data approximation

• But: Trade-off between processing speed and the


ability to provide precise query answers
Big Data Management and Analytics 206
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sampling

• Select a subset of the data


→ Reduce the amount of data to process

• Difficulty: Obtaining a Reservoir Sampling Algorithm


input: Stream , Size of reservoir
representative sample begin
Insert first objects into reservoir;
foreach ∈ do
• Simplest form: random Let be the position of ;
sampling if M
≔ random integer in range 1. . ;
then
– Reservoir Sampling Insert into reservoir;
– Min-Wise Sampling Delete an instance from the reservoir at random;

• Load Shedding: Discard some fractions of data if the arrival


rate of the stream might overload the system
Big Data Management and Analytics 207
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Data Synopsis & Histograms

• Summaries of data objects oftenly used to reduce


the amount of data
– e.g. Microclusters that describe groups of similar
objects

• Histograms are used to approximate the frequency


distribution of element values
– Commonly used for query optimizers (e.g. range
queries)

Big Data Management and Analytics 208


DATABASE
SYSTEMS
Stream Processing
GROUP

• Overview of techniques to build a summary (reduced


representation) of a sequence of numeric attributes:

0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120

DFT DWT SVD APCA PAA PLA

Big Data Management and Analytics 209


DATABASE
SYSTEMS
Stream Processing
GROUP

Diskrete Wavelet Transformation (DWT)


• Idea:
X

• Sequence represented as linear


X'

DWT
combination of basic wavelet 0 20 40 60 80 100 120 140

functions Haar 0

• Wavelet transformation decomposes Haar 1

a signal into several groups of Haar 2

coefficients at different scales


Haar 3

• Small coefficients can be eliminated


 Small errors when reconstructing Haar 4

the signal Haar 5

Take only the first function Haar 6

coefficents Haar 7

• Often: Haar-wavelets used (easy to implement)


Big Data Management and Analytics 210
DATABASE
SYSTEMS
Stream Processing
GROUP

Example:
Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1]
h1= 4 = h2 = 2 = h3 = 2 = h4 = -1 =
X = {8, 4, 1, 3} mean(8,4,1, mean(8,4) - (8-4)/2 (1-3)/2
8 3) h1
7
6
5
4
3
2
1

(Lossless) Reconstruction of original sequence (stream) from Haar-wavelet representation:

h1 = 4 h2 = 2 h3 = 2 h4 = -1 X = {8, 4, 1, 3}
8
7
6
5
4
3
2
1

Big Data Management and Analytics 211


DATABASE
SYSTEMS
Stream Processing
GROUP

Haar Wavelet Transformation


Input sequence: Haar Wavelet Transform Algorithm
input: Sequence S , ,…, , of even length
output: Sequence of wavelet coefficients
begin
Transform into a sequence of two‐component‐vectors
1 1
, ,…, , where ⋅ ⋅ ;
1 1
Separate the sequences and ;
Recursively transform sequence ;
Step 1:
2 5, 8 9, 7 4, 1 1 /2, 2 5, 8 9, 7 4, 1 1 /2
3.5, 8.5, 5.5, 0 , 1.5, 0.5, 1.5, 1
Step 2:
3.5 8.5, 5.5 0 /2, 3.5 8.5, 5.5 0 /2
6, 2.75 , 2.5, 2.75
Step 3:
6 2.75 /2, 6 2.75 /2
4.375, 1.625
→ Wavelet coefficients 4.375, 1.625, 2.5, 2.75, 1.5, 0.5, 1.5, 1
Big Data Management and Analytics 212
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Spark‘s Streaming Framework build on top of Spark‘s Core


API

• Data ingestion from several different data sources

• Stream processing might be combined with other Spark


libraries (e.g. Spark Mllib)

Big Data Management and Analytics 213


DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Spark‘s Streaming Workflow:

• Streaming engine receives data from input streams


• Data stream is divided into several microbatches, i.e.
sequences of RDDs
• Microbatches are processed by Spark engine
• The result is a data stream of batches of processed data
Big Data Management and Analytics 214
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• DStreams (Discretized Streams) as basic abstraction

• Any operation applied on a DStream translates to


operations on the underlying RDDs (computed by Spark
Engine)
• StreamingContext objects as starting points
sc = SparkContext(master, appName)
ssc = StreamingContext(sc, 1) #params: SparkContext, time interval

Big Data Management and Analytics 215


DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

General schedule for a Spark Streaming application:


1. Define the StreamingContext ssc
2. Define the input sources by creating input DStreams
3. Define the streaming computations by applying
transformations and output operations to Dstreams
4. Start receiving data and processing it using ssc.start()
5. Wait for the processing to be stopped (manually or due to
any error) using ssc.awaitTermination()
6. The processing can be manually stopped using ssc.stop()
Big Data Management and Analytics 216
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming
#Create a local StreamingContext with two working threads and batch
#interval of 1 sec
sc = SparkContext(“local[2]“,“NetworkWordCount“)
ssc = StreamingContext(sc, 1)
#Create a DStream that will connect to localhost:9999
lines = ssc.socketTextStream(“localhost“, 9999)
#Split each line into words
words = lines.flatMap(lambda line: line.split(“ “))
#Count each word in each batch
pairs = words.map(lambda word: (word,1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#Print the first ten elements of each RDD of this DStream to the console
wordCounts.pprint()
#Start the computation and wait for it to terminate
ssc.start()
ssc.awaitTermination()
Big Data Management and Analytics 217
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Support of window operations

• Two basic parameters:


– windowLength
– slideInterval

• Support of many transformations for windowed DStreams

#Reduce last 30 sec of data, every 10 sec


winWordCounts = pairs
.reduceByKeyAndWindow(lambda x,y: x+y, 30, 10)

Big Data Management and Analytics 218


DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Alternative to Spark Streaming

• Support of Real-time Processing

• Three abstractions:
– Spouts
– Bolts
– Topologies

Big Data Management and Analytics 219


DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Spouts:
– Source of streams
– Typically reads from queuing brokers (e.g. Kafka, RabbitMQ)
– Can also generate its own data or read from external sources (e.g.
Twitter)

• Bolts:
– Processes any number of input streams
– Produces any number of output streams
– Holds most of the logic of the computations (functions, filters,…)

Big Data Management and Analytics 220


DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Topologies:
– Network of spouts and bolts
– Each edge represents a bolt subscribing to the output stream of
some other spout or bolt
– A topology is an arbitrarily complex multi-stage stream computation

Big Data Management and Analytics 221


DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Streams:
– Core abstraction in Storm
– A stream is an unbounded sequence of tuples that is
processed and created in parallel in a distributed fashion
– Tuples can contain standard types like integers, floats,
shorts, booleans, strings and so on
– Custom types can be used if a own serializer is defined
– A stream grouping defines how that stream should be
partitioned among the bolt's tasks
Big Data Management and Analytics 222
DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm
Spout Bolt Bolt
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)


.setNumTasks(4)
.shuffleGrouping("blue-spout");
// 4 Tasks spread across 2 Executors and the TOPOLOGY
// tuples shall be randomly distributed across
// the bolt‘s tasks, each bolt shall get an Worker Process Worker Process
// equal number of tuples
Executor Executor Executor Executor
topologyBuilder.setBolt("yellow-bolt", Task Task Task Task
new YellowBolt(), 6)
.shuffleGrouping("green-bolt"); Task Task
Executor Executor
StormSubmitter.submitTopology( Task Task
"mytopology",
conf,
topologyBuilder.createTopology() Executor Executor Executor Executor
); Task Task Task Task

Big Data Management and Analytics 223


DATABASE
SYSTEMS
Stream Processing
GROUP

Further Reading

• Joao Gama: Knowledge Discovery from Data Streams


(http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf)
• Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of
Massive Datasets
• Holden Karau, Andy Konwinski, Patrick Wendell, Matei
Zaharia: Learning Spark - Lightning-Fast Big Data Analysis
• http://spark.apache.org/docs/latest/streaming-programming-
guide.html
• http://storm.apache.org/documentation/Concepts.html

Big Data Management and Analytics 224

You might also like