0% found this document useful (0 votes)

24 views

Chapter-5 Stream Processing Part1

Chapter 5 discusses stream processing in big data management, covering data streams, stream models, and various streaming methods. It highlights key characteristics of data streams, introduces data stream management systems, and explains tools like Apache Spark Streaming and Apache Storm for real-time processing. The chapter also addresses challenges in processing high-velocity data and techniques for data reduction and approximation.

Uploaded by

hplamd5000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Chapter-5 Stream Processing Part1

Uploaded by

hplamd5000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter 5:

Stream Processing

Big Data Management and Analytics 193

DATABASE
SYSTEMS
Stream Processing
GROUP

Today‘s Lesson

• Data Streams & Data Stream Management System

• Data Stream Models

• Insert-Only
• Insert-Delete
• Additive

• Streaming Methods
• Sliding Windows & Ageing
• Data Synopsis

• Stream Processing – Concepts & Tools

• Micro-Batching with Apache Spark Streaming
• Real-time Stream Processing with Apache Storm

Big Data Management and Analytics 194

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams

Big Data Management and Analytics 195

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams

• Definition:
A data stream can be seen as a continuous and potentially
infinite stochastic process in which events occur indepen-
dently from another

• Huge amount of data

→ Data objects cannot be stored

• Single scan

Big Data Management and Analytics 196

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Streams – Key Characteristics

• The data elements in the stream arrive on-line

• The system has no control over the order in which data

elements arrive (either within a data stream or across
multiple data streams)

• Data streams are potentially unbound in size

• Once an element has been processed it is discarded or

archived

Big Data Management and Analytics 197

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Management System

Ad-hoc queries

Data Streams
Stream Processor
Output
Streams
Standing
query
time

Limited
working
storage
Archival Storage

Big Data Management and Analytics 198

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Insert-Only Model

• Once an element is seen, it cannot be changed

Stream Stream
Processor Processor

time

Big Data Management and Analytics 199

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Insert-Delete Model

• Elements can be deleted or updated

3 4 3
2 4 Stream 2 Stream
Processor Processor
4
4

time

Big Data Management and Analytics 200

DATABASE
SYSTEMS
Stream Processing
GROUP

Data Stream Models – Additive Model

• Each element is an increment to the previous

version of the given data object
2 3 2
Stream Stream
Processor Processor
3

time

1 5 11 4 4 5 11 4

Big Data Management and Analytics 201

DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods

• Huge amount of data vs. limited resources in

space → impractical to store all data

• Solutions:
• Storing summaries of previously seen data
• „Forgetting“ stale data

• But: Trade-off between storage space and the

ability to provide precise query answers
Big Data Management and Analytics 202
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sliding Windows

• Idea: Keep most recent stream elements in main

memory and discard older ones

• Timestamp-based:

Data Stream

Sliding interval Window length

Big Data Management and Analytics 203

DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sliding Windows

• Idea: Keep most recent stream elements in main

memory and discard older ones

• Sequence-based:

Data Stream

Sliding interval Window length

Big Data Management and Analytics 204

DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Ageing

• Idea: Keep only the summary in main memory and

discard objects as soon as they are processed

Data Stream

• Multiply the summary with a decay factor after

each time epoche, resp. after a certain amount of
occuring elements

Big Data Management and Analytics 205

DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods

• High velocity of incoming data vs. limited resour-

ces in time → impossible to process all data

• Solutions:
• Data reduction
• Data approximation

• But: Trade-off between processing speed and the

ability to provide precise query answers
Big Data Management and Analytics 206
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Sampling

• Select a subset of the data

→ Reduce the amount of data to process

• Difficulty: Obtaining a Reservoir Sampling Algorithm

input: Stream , Size of reservoir
representative sample begin
Insert first objects into reservoir;
foreach ∈ do
• Simplest form: random Let be the position of ;
sampling if M
≔ random integer in range 1. . ;
then
– Reservoir Sampling Insert into reservoir;
– Min-Wise Sampling Delete an instance from the reservoir at random;

• Load Shedding: Discard some fractions of data if the arrival

rate of the stream might overload the system
Big Data Management and Analytics 207
DATABASE
SYSTEMS
Stream Processing
GROUP

Streaming Methods – Data Synopsis & Histograms

• Summaries of data objects oftenly used to reduce

the amount of data
– e.g. Microclusters that describe groups of similar
objects

• Histograms are used to approximate the frequency

distribution of element values
– Commonly used for query optimizers (e.g. range
queries)

Big Data Management and Analytics 208

DATABASE
SYSTEMS
Stream Processing
GROUP

• Overview of techniques to build a summary (reduced

representation) of a sequence of numeric attributes:

0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120

DFT DWT SVD APCA PAA PLA

Big Data Management and Analytics 209

DATABASE
SYSTEMS
Stream Processing
GROUP

Diskrete Wavelet Transformation (DWT)

• Idea:
X

• Sequence represented as linear

DWT
combination of basic wavelet 0 20 40 60 80 100 120 140

functions Haar 0

• Wavelet transformation decomposes Haar 1

a signal into several groups of Haar 2

coefficients at different scales

Haar 3

• Small coefficients can be eliminated

 Small errors when reconstructing Haar 4

the signal Haar 5

Take only the first function Haar 6

coefficents Haar 7

• Often: Haar-wavelets used (easy to implement)

Big Data Management and Analytics 210
DATABASE
SYSTEMS
Stream Processing
GROUP

Example:
Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1]
h1= 4 = h2 = 2 = h3 = 2 = h4 = -1 =
X = {8, 4, 1, 3} mean(8,4,1, mean(8,4) - (8-4)/2 (1-3)/2
8 3) h1
7
6
5
4
3
2
1

(Lossless) Reconstruction of original sequence (stream) from Haar-wavelet representation:

h1 = 4 h2 = 2 h3 = 2 h4 = -1 X = {8, 4, 1, 3}
8
7
6
5
4
3
2
1

Big Data Management and Analytics 211

DATABASE
SYSTEMS
Stream Processing
GROUP

Haar Wavelet Transformation

Input sequence: Haar Wavelet Transform Algorithm
input: Sequence S , ,…, , of even length
output: Sequence of wavelet coefficients
begin
Transform into a sequence of two‐component‐vectors
1 1
, ,…, , where ⋅ ⋅ ;
1 1
Separate the sequences and ;
Recursively transform sequence ;
Step 1:
2 5, 8 9, 7 4, 1 1 /2, 2 5, 8 9, 7 4, 1 1 /2
3.5, 8.5, 5.5, 0 , 1.5, 0.5, 1.5, 1
Step 2:
3.5 8.5, 5.5 0 /2, 3.5 8.5, 5.5 0 /2
6, 2.75 , 2.5, 2.75
Step 3:
6 2.75 /2, 6 2.75 /2
4.375, 1.625
→ Wavelet coefficients 4.375, 1.625, 2.5, 2.75, 1.5, 0.5, 1.5, 1
Big Data Management and Analytics 212
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Spark‘s Streaming Framework build on top of Spark‘s Core

API

• Data ingestion from several different data sources

• Stream processing might be combined with other Spark

libraries (e.g. Spark Mllib)

Big Data Management and Analytics 213

DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Spark‘s Streaming Workflow:

• Streaming engine receives data from input streams

• Data stream is divided into several microbatches, i.e.
sequences of RDDs
• Microbatches are processed by Spark engine
• The result is a data stream of batches of processed data
Big Data Management and Analytics 214
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• DStreams (Discretized Streams) as basic abstraction

• Any operation applied on a DStream translates to

operations on the underlying RDDs (computed by Spark
Engine)
• StreamingContext objects as starting points
sc = SparkContext(master, appName)
ssc = StreamingContext(sc, 1) #params: SparkContext, time interval

Big Data Management and Analytics 215

DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

General schedule for a Spark Streaming application:

1. Define the StreamingContext ssc
2. Define the input sources by creating input DStreams
3. Define the streaming computations by applying
transformations and output operations to Dstreams
4. Start receiving data and processing it using ssc.start()
5. Wait for the processing to be stopped (manually or due to
any error) using ssc.awaitTermination()
6. The processing can be manually stopped using ssc.stop()
Big Data Management and Analytics 216
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming
#Create a local StreamingContext with two working threads and batch
#interval of 1 sec
sc = SparkContext(“local[2]“,“NetworkWordCount“)
ssc = StreamingContext(sc, 1)
#Create a DStream that will connect to localhost:9999
lines = ssc.socketTextStream(“localhost“, 9999)
#Split each line into words
words = lines.flatMap(lambda line: line.split(“ “))
#Count each word in each batch
pairs = words.map(lambda word: (word,1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
#Print the first ten elements of each RDD of this DStream to the console
wordCounts.pprint()
#Start the computation and wait for it to terminate
ssc.start()
ssc.awaitTermination()
Big Data Management and Analytics 217
DATABASE
SYSTEMS
Stream Processing
GROUP

Spark Streaming

• Support of window operations

• Two basic parameters:

– windowLength
– slideInterval

• Support of many transformations for windowed DStreams

#Reduce last 30 sec of data, every 10 sec

winWordCounts = pairs
.reduceByKeyAndWindow(lambda x,y: x+y, 30, 10)

Big Data Management and Analytics 218

DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Alternative to Spark Streaming

• Support of Real-time Processing

• Three abstractions:
– Spouts
– Bolts
– Topologies

Big Data Management and Analytics 219

DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Spouts:
– Source of streams
– Typically reads from queuing brokers (e.g. Kafka, RabbitMQ)
– Can also generate its own data or read from external sources (e.g.
Twitter)

• Bolts:
– Processes any number of input streams
– Produces any number of output streams
– Holds most of the logic of the computations (functions, filters,…)

Big Data Management and Analytics 220

DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Topologies:
– Network of spouts and bolts
– Each edge represents a bolt subscribing to the output stream of
some other spout or bolt
– A topology is an arbitrarily complex multi-stage stream computation

Big Data Management and Analytics 221

DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm

• Streams:
– Core abstraction in Storm
– A stream is an unbounded sequence of tuples that is
processed and created in parallel in a distributed fashion
– Tuples can contain standard types like integers, floats,
shorts, booleans, strings and so on
– Custom types can be used if a own serializer is defined
– A stream grouping defines how that stream should be
partitioned among the bolt's tasks
Big Data Management and Analytics 222
DATABASE
SYSTEMS
Stream Processing
GROUP

Apache Storm
Spout Bolt Bolt
Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)

.setNumTasks(4)
.shuffleGrouping("blue-spout");
// 4 Tasks spread across 2 Executors and the TOPOLOGY
// tuples shall be randomly distributed across
// the bolt‘s tasks, each bolt shall get an Worker Process Worker Process
// equal number of tuples
Executor Executor Executor Executor
topologyBuilder.setBolt("yellow-bolt", Task Task Task Task
new YellowBolt(), 6)
.shuffleGrouping("green-bolt"); Task Task
Executor Executor
StormSubmitter.submitTopology( Task Task
"mytopology",
conf,
topologyBuilder.createTopology() Executor Executor Executor Executor
); Task Task Task Task

Big Data Management and Analytics 223

DATABASE
SYSTEMS
Stream Processing
GROUP

• Joao Gama: Knowledge Discovery from Data Streams

(http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf)
• Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of
Massive Datasets
• Holden Karau, Andy Konwinski, Patrick Wendell, Matei
Zaharia: Learning Spark - Lightning-Fast Big Data Analysis
• http://spark.apache.org/docs/latest/streaming-programming-
guide.html
• http://storm.apache.org/documentation/Concepts.html

Big Data Management and Analytics 224

Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Unit II(Big Data)
No ratings yet
Unit II(Big Data)
19 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
DWDM - Unit - VII
No ratings yet
DWDM - Unit - VII
42 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Bda Ut2 Que Ans
No ratings yet
Bda Ut2 Que Ans
14 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Mining Data Streams
No ratings yet
Mining Data Streams
37 pages
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
No ratings yet
FALLSEM2024-25_SWE2011_ETH_VL2024250103282_2024-08-19_Reference-Material-I
53 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
b0m33bdt-7p-spark-databricks-streaming_2023_en
No ratings yet
b0m33bdt-7p-spark-databricks-streaming_2023_en
50 pages
BDA-2
No ratings yet
BDA-2
16 pages
a.
No ratings yet
a.
3 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Bda M4
No ratings yet
Bda M4
57 pages
BDA GTU Study Material Presentations Unit-4 29092021094703AM
No ratings yet
BDA GTU Study Material Presentations Unit-4 29092021094703AM
33 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Data Stream Mg
No ratings yet
Data Stream Mg
528 pages
BigData_Mod2
No ratings yet
BigData_Mod2
12 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
DBT Unit4 PDF
No ratings yet
DBT Unit4 PDF
152 pages
UNIT-II 30-1-24
No ratings yet
UNIT-II 30-1-24
162 pages
Mod4_DWDM_BTECH
No ratings yet
Mod4_DWDM_BTECH
9 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
De Unit-V
No ratings yet
De Unit-V
46 pages
Real Time Data Stream Processing Engine
No ratings yet
Real Time Data Stream Processing Engine
13 pages
Kafka
No ratings yet
Kafka
78 pages
Stream Processing With: Tamás István Ujj
No ratings yet
Stream Processing With: Tamás István Ujj
27 pages
unit 4 Streaming data
No ratings yet
unit 4 Streaming data
4 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
3. Unit 3 - BD - Streaming
No ratings yet
3. Unit 3 - BD - Streaming
42 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
lec19
No ratings yet
lec19
23 pages
5 Unit
No ratings yet
5 Unit
5 pages
Module II
No ratings yet
Module II
22 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Big Data Ppt
No ratings yet
Big Data Ppt
37 pages
Unit 3
No ratings yet
Unit 3
30 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
unit-3 notes
No ratings yet
unit-3 notes
10 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
lec19
No ratings yet
lec19
24 pages
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
No ratings yet
Data Stream Processing - An Overview: Sangeetha Seshadri Sangeeta@cc - Gatech.edu
68 pages
Streaming Databases Unifying Batch and Stream Processing Hubert Dulay all chapter instant download
100% (7)
Streaming Databases Unifying Batch and Stream Processing Hubert Dulay all chapter instant download
58 pages
Spark Streaming
No ratings yet
Spark Streaming
19 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Streaming Databases Unifying Batch and Stream Processing Hubert Dulay - Get instant access to the full ebook content
No ratings yet
Streaming Databases Unifying Batch and Stream Processing Hubert Dulay - Get instant access to the full ebook content
44 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Output SmartPLS 27 September 2024 Brostrapping
No ratings yet
Output SmartPLS 27 September 2024 Brostrapping
153 pages
DET40073 Topic 1
No ratings yet
DET40073 Topic 1
54 pages
(Inches) (Inches) (Inches) (Inches) (Inches) : Nominal Pipe Size NPS Class 150
No ratings yet
(Inches) (Inches) (Inches) (Inches) (Inches) : Nominal Pipe Size NPS Class 150
10 pages
0304 1 16 Indian Calendar
No ratings yet
0304 1 16 Indian Calendar
20 pages
F.A - 4: Basic Test Bank For Finance
No ratings yet
F.A - 4: Basic Test Bank For Finance
3 pages
Electrical Braking
No ratings yet
Electrical Braking
32 pages
PrivyID API Documentation v1.9
No ratings yet
PrivyID API Documentation v1.9
31 pages
Lessons I Learned During Installation and Wiring of LV Switchboard
No ratings yet
Lessons I Learned During Installation and Wiring of LV Switchboard
15 pages
Lesson 2 Decision Making
No ratings yet
Lesson 2 Decision Making
6 pages
Series: Aluminum Electrolytic Capacitors
No ratings yet
Series: Aluminum Electrolytic Capacitors
4 pages
"On-Off" RGD Signaling Using Azobenzene Photoswitch-Modified Surfaces
No ratings yet
"On-Off" RGD Signaling Using Azobenzene Photoswitch-Modified Surfaces
9 pages
DartVision (Proposal)
No ratings yet
DartVision (Proposal)
53 pages
Tricks to Solve Reasoning Questions PDF
No ratings yet
Tricks to Solve Reasoning Questions PDF
5 pages
Contextual Analysis of Khat-E-Nastaleeq
100% (2)
Contextual Analysis of Khat-E-Nastaleeq
17 pages
Maths Class 7 Question Bank
No ratings yet
Maths Class 7 Question Bank
117 pages
HPS Virtualization Whitepaper
No ratings yet
HPS Virtualization Whitepaper
11 pages
Report Cross Section TERM - ZUCH TP2502J-3 CB 26AWG
No ratings yet
Report Cross Section TERM - ZUCH TP2502J-3 CB 26AWG
2 pages
Focus Lab Report
No ratings yet
Focus Lab Report
1 page
Measuring Volume
No ratings yet
Measuring Volume
8 pages
Algorithms - 3x3 PLL
No ratings yet
Algorithms - 3x3 PLL
1 page
Copy+of+Conics +Ellipse+&+Hyperbola+ +JEE+Super+Revision
No ratings yet
Copy+of+Conics +Ellipse+&+Hyperbola+ +JEE+Super+Revision
85 pages
Lighting Calc
No ratings yet
Lighting Calc
28 pages
Computer Fundamentals & Programming: COMSCI 1200
No ratings yet
Computer Fundamentals & Programming: COMSCI 1200
37 pages
An Empirical Investigation of Job Stress
100% (1)
An Empirical Investigation of Job Stress
24 pages
HVAC Maintenance PDF
No ratings yet
HVAC Maintenance PDF
13 pages
Algebra 2 Unit 5 Real World Project
67% (3)
Algebra 2 Unit 5 Real World Project
7 pages
P4.2 Energy Conservation Knowledge Organiser
No ratings yet
P4.2 Energy Conservation Knowledge Organiser
3 pages
Technical Service Bulletin: Partial Repair of C-Mdps Motor, Ecu And/Or Steering Column Housing Notice
No ratings yet
Technical Service Bulletin: Partial Repair of C-Mdps Motor, Ecu And/Or Steering Column Housing Notice
5 pages
Solar Car
No ratings yet
Solar Car
13 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
6 pages