0% found this document useful (0 votes)
17 views

Lec 05

The document discusses Spark Streaming and its advantages over other streaming systems. It describes how Spark Streaming receives live data streams, processes them using micro-batches that are treated as RDDs, and outputs results in batches with low latency. It also covers integration with batch processing and stateful stream processing.

Uploaded by

youmnaxxeid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lec 05

The document discusses Spark Streaming and its advantages over other streaming systems. It describes how Spark Streaming receives live data streams, processes them using micro-batches that are treated as RDDs, and outputs results in batches with low latency. It also covers integration with batch processing and stateful stream processing.

Uploaded by

youmnaxxeid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3/24/2024

Spark/Structured
Streaming

16

Motivation for Real-Time Stream Processing

 Data is being created at unprecedented rates


 Exponential data growth from mobile, web, and
social
 Connected devices: 9B in 2012 to 50B by 2020
 Over 1 trillion sensors by 2020
 Datacenter IP traffic is growing at a CAGR of 25%
 Many important applications must process large streams of
live data and provide results in near-real-time
• Social network trends
• Website statistics
• Ad impressions
•…

17

1
3/24/2024

Motivation
 How can we harness the data in real time?
 Value can quickly degrade → capture value
 Immediately
 From reactive analysis to direct operational impact
 Unlocks new competitive advantages
 Requires a completely new approach...
 A distributed stream processing framework is required to
 Scale to large clusters (100s of machines)
 Achieve low latency (few seconds)

18

Why Streaming?
 “Without stream processing, there’s no big data and no Internet of Things” – Dana Sandu, SQLstream

 Operational Efficiency - 1 extra mph for a locomotive on its daily route can lead to $200M in savings (Norfolk Southern)

 Tracking Behavior – Mcdonald’s (Netherlands) realized a 700% increase in offer redemptions using personalized advertising

based on location, weather, previous purchase, and preference.

 Predict machine failure - GE monitors over 5500 assets from 70+ customer sites globally. Can predict failure and

determine when something needs maintenance

 Improving Traffic Safety and Efficiency – According to the EU Commission, congestion in EU urban areas costs ~ €100

billion or 1 percent of EU GDP annually

19

2
3/24/2024

Use Cases Across Industries

Credit Transportation Retail Consumer Internet &


Identify Dynamic Re-routing Of • Dynamic Inventory Mobile
fraudulent transactions as soon traffic or Management Optimize user
as they occur. Vehicle Fleet. • Real-time engagement based on
In-store Offers and user’s current behavior.
recommendations

Healthcare Manufacturing Surveillance Digital Advertising &


Continuously
monitor patient vital stats • Identify equipment Identify threats Marketing
and proactively identify failures and react and intrusions In real- Optimize and personalize
at-risk patients. instantly time content based on real-time
• Perform Proactive information.
maintenance.

20

From Volume and Variety to Velocity

Big Data has evolved


Past
Big-Data = Volume + Variety Present
Big-Data = Volume + Variety + Velocity

Hadoop Ecosystem evolves as well…


Past
Present
Batch Processing
Batch + Stream Processing
Time to insight of Hours
Time to Insight of Seconds

21

3
3/24/2024

Spark + Streaming

Spark is powerful for processing huge chunks of data. Can it be used on


streaming data?
Spark provides two ways of streaming:
• Spark Streaming
• Structured Streaming (introduced with Spark 2.x)
Spark Streaming provides DStream API on Spark RDDs.
Structured Streaming is built on Spark SQL and is based on Dataframe
and Dataset APIs to enable SQL query or Scala operations on streaming
data.

22

Spark Streaming

23

4
3/24/2024

What is Spark Streaming?


Provides efficient, fault-tolerant
stateful stream processing
Provides a simple API for
implementing complex
algorithms
Integrates with Spark’s batch
and interactive processing
Integrates with other Spark
extensions

24

What is Spark Streaming?

Extends Spark for doing large-scale stream processing


Scales to 100s of nodes and achieves second scale latencies
Efficient and fault-tolerant stateful stream processing
Simple batch-like API for implementing complex algorithms
High throughput on large data streams

25

5
3/24/2024

Integration with Batch Processing

 Many environments require processing same data in live streaming as well


as batch post processing
 Existing framework cannot do both
• Either do stream processing of 100s of MB/s with low latency
• Or do batch processing of TBs / PBs of data with high latency
 Extremely painful to maintain two different stacks
• Different programming models
• Double the implementation effort
• Double the number of bugs

26

Stateful Stream Processing

mutable state
 Traditional
streaming has a record-at-
a-time processing model input
• Each node has a mutable state records
• For each record, update the state
and send new records node 1

 State is lost if a node dies! node 3


 Making stateful stream processing input
fault-tolerant is challenging records

node 2

27

6
3/24/2024

Existing Streaming Systems

Storm
• Replays record if not processed by a node
• Processes each record at least once
• May update mutable state twice!
• Mutable state can be lost due to failure!
Trident – Use transactions to update state
• Processes each record exactly once
• Per state transaction to external database is slow

28

What is Spark Streaming?

 Receive data streams from input


sources, process them in a cluster,
push out to databases / dashboards.
 Scalable, fault-tolerant, second-scale
latencies

29

7
3/24/2024

High-level
Architecture

30

Spark Streaming

Incoming data represented as


Discretized Streams (DStreams)
Stream is broken down into micro-
batches
Each micro-batch is an RDD – can
share code between batch and
streaming

31

8
3/24/2024

Discretized Stream Processing 1


live data stream
Spark Streaming

• Run a streaming computation as a series of very


small, deterministic batch jobs

• Chop up the live stream into batches of X seconds


batches of X seconds

• Spark treats each batch of data as RDDs and


processes them using RDD operations Spark

• Finally, the processed results of the RDD operations processed results

are returned in batches

32

Discretized Stream Processing 2

live data stream


Spark
Streaming
• Batch sizes as low as ½ second,
latency of about 1 second
batches of X
• Potential for combining batch
seconds
processing and streaming processing
in the same system

Spark
processed
results

33

9
3/24/2024

Working of Spark Streaming

• It takes live input data streams and


then divides them into batches.
• After this, the Spark engine
processes those streams and
generates the final stream results in
batches.

34

Spark Streaming Programming Model

Discretized Stream (DStream)


• Represents a stream of data
• Implemented as a sequence of RDDs
DStreams API very similar to RDD API
• Functional APIs in Scala, Java, Python
• Create input DStreams from different sources
• Apply parallel operations

35

10

You might also like