SlideShare a Scribd company logo
Introto Spark Streaming
(pandemic edition)
Oleg Korolenko for RSF Talks @Ktech, March 2020
image credits: @Matt Turck - Big Data Landscape 2017
Agenda
1.Some streaming concepts (quickly)
2.Streaming models: Microbatchning vs One-record-a-
Time models
3.Windowing, watermarks, state management
4.Operations on state and joins
5.Sources and Sinks
Oleg Korolenko for RSF Talks @Ktech, March 2020
Notinthistalk
» Spark as distributed compute engine
» I will not cover specific integrations (like with
Kafka)
» I will not compare it to some specific streaming
solutions
Oleg Korolenko for RSF Talks @Ktech, March 2020
API hell
- DStreams (deprecated)
- Continuous mode (experimental from 2.3)
- Structured Streaming (the way to go, in this talk)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Streaming concepts: Data
Data in motion vs data at rest (in the past)
Potentially unbounded vs known size
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streaming - Concept
» serves small batches of data collected from stream
» provides them at fixed time intervals (from 0.5
secs)
» performs computation
image credits: Spark official doc
Microbatching
application of Bulk Synchronous Parallelism (BSP)
system
Consists of :
1. A split distribution of asynchronous work (tasks)
2. A synchronous barrier, coming in at fixed
intervals (stages)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: Microbatching
Transforms a batch-like query into a series of
incremental execution plans
Oleg Korolenko for RSF Talks @Ktech, March 2020
One-record-at-a-time-processing
Dataflow programming
- computation is a graph of data flowing between
operations
- computations are black boxes one to-each other ( vs
Catalyst in Spark)
In : ApacheFlink, Google DataFlow
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: One-record-at-a-time-processing
processing user functions by pipelining
- deploys functions as pipelines in a cluster
- flows data through pipelines
- pipelines steps are parallilized (differently,
depedning on operators)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Microbatchingvs One-at-a-time
despite higher latency
PROS:
1.sync boundaries gives the ability to adapt (f.i
task recovering from failure if executor is down,
scala executors etc)
2.data is available as a set at every microbatch (we
can inspect, adapt, drop, get stats)
3.easier model that looks like data at rest
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI
» API on top of Spark SQL Dataframe,Dataset APIs
// Read text from socket
val socketDF = spark
.readStream
.format("socket")
.option(...)
.load()
socketDF.isStreaming // Returns True for DataFrames that have streaming sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI, behindthe lines
[DataFrame/Dataset] =>
[Logical plan] =>
[Optimized plan] =>
[Series of incremental execution plans]
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Run only once:
val onceStream = data
.writeStream
.format("console")
.queryName("Once")
.trigger(Trigger.Once())
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Scheduled execution based on processing time:
val processingTimeStream = data
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
processing hasn't yet finished next batch will start
immediately
Oleg Korolenko for RSF Talks @Ktech, March 2020
Processing
We can use usual Spark transformation and aggregation
APIs
but where's streaming semantics there ?
Oleg Korolenko for RSF Talks @Ktech, March 2020
credits: https://twitter.com/bgeerdink/status/776003500656517120
Processing:WindowingAPI
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Tumblingwindow
eventsDF
.groupBy(window("eventTime", "5 minute"))
.count()
image credits: @DataBricks Engineering blog
Slidingwindow
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.count()
image credits: @DataBricks Engineering blog
Late events
image credits: @DataBricks Engineering blog
Watermarks
"all input data with event times less than X have
been observed"
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.watermark("10 minutes")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Watermarks
image credits: @DataBricks Engineering blog
Statefulprocessing
Work with data in the context of what we had already
seen in the stream
Oleg Korolenko for RSF Talks @Ktech, March 2020
State management
image credits: @DataBricks Engineering blog
State managementand checkpoints
Backed by S3-compatible interface to store state
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations - State
mapWithState // we produce a single result
flatMapWithState // we produce 0 or N results in output
Oleg Korolenko for RSF Talks @Ktech, March 2020
Example: Domain
// Input events
val weatherEvents: Dataset[WeatherEvents]
// Weather station event
case class WeatherEvent(
stationId: String,
timestamp: Timestamp,
temp: Double
)
// Weather avg temp output
case class WeatherEventAvg(
stationId: String,
start: Timestamp,
end: Timestamp,
avgTemp: Double
)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Compute using state
val weatherEventsMovingAvg = weatherEvents
// group by station
.groupByKey(_.stationId)
// processing timeout
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)
(mappingFunction)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Mapping function
def mappingFunction(
key: String,
values: Iterator[WeatherEvent],
groupState: GroupState[List[WeatherEvent]]
): WeatherEventAvg = {
// update the state with the new events
val updatedState = ...
// update the group state
groupState.update(updatedState)
// compute new event output using updated state
WeatherEventAvg(key, ts1, ts2, tempAvg)
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Writetoasinkand startthe stream
// define the sink for the stream
weatherEventsMovingAvg
.writeStream
.format("kafka") // determines that the kafka sink is used
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("checkpointLocation", "/path/checkpoint")
// stream will start processing events from sources and write to sink
.start()
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations -Joins
» stream join stream
» stream join batch
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sources
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
» TCP sockets
Oleg Korolenko for RSF Talks @Ktech, March 2020
Workingwith sources
image credits: Stream Processing with Apache Spark @OReilly
Offsets in checkpoints
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sinks
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
Experimentation:
- Memory, Console
Custom:
- forEach (implement ForEachWriter to integrate with
anything)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Failure recovery
» Spark uses checkpoints
Write Ahead Log (WAL)
» for Spark Streaming hwen we receive data from
sources we buffer it
» we need to store additional metadata to register
offsets etc
» we save on offset, data to be able to replay it
from sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
"Exactlyonce" deliveryguarantee
Combination of
replayable sources
idempotent sinks
processing checkpoints
Oleg Korolenko for RSF Talks @Ktech, March 2020
Readsand refs
1.Streaming 102:The World beyond Batch(article) by Tyler Akidau,
2016
2.Stream Processing with Apache Flink by Fabian Hueske and
Vasiliki Kalavri, O'Reilly, April 2019
3.Stream Processing with Apache Spark by Francois Garillot and
Gerard Maas, O'Reilly, 2019
4.Discretized Streams: Fault-Tolerant Streaming Computation at
Scale(whitepaper) by MatheiZaharia, Berkley
5.Event-time Aggregation and Watermarking in Apache Spark’s
Structured Streaming by Tathagata Das, DataBricks enginnering
blog
Oleg Korolenko for RSF Talks @Ktech, March 2020
Thanks !
Oleg Korolenko for RSF Talks @Ktech, March 2020

More Related Content

What's hot (20)

PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
FastR+Apache Flink
Juan Fumero
 
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PDF
Make your PySpark Data Fly with Arrow!
Databricks
 
PDF
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Databricks
 
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
PDF
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Monitoring pg with_graphite_grafana
Jan Wieck
 
PDF
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PDF
Stream Processing: Choosing the Right Tool for the Job
Databricks
 
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
PDF
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Spark Summit
 
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
FastR+Apache Flink
Juan Fumero
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
Use r tutorial part1, introduction to sparkr
Databricks
 
Make your PySpark Data Fly with Arrow!
Databricks
 
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
Understanding Query Plans and Spark UIs
Databricks
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Databricks
 
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Making Nested Columns as First Citizen in Apache Spark SQL
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Monitoring pg with_graphite_grafana
Jan Wieck
 
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Stream Processing: Choosing the Right Tool for the Job
Databricks
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Spark Summit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 

Similar to Spark Streaming Intro @KTech (20)

PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PPTX
What's New in Spark 2?
Eyal Ben Ivri
 
PDF
Big data apache spark + scala
Juantomás García Molina
 
PPTX
PowerStream Demo
SingleStore
 
PPTX
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
PPT
PDE2011 pythonOCC project status and plans
Thomas Paviot
 
PPTX
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
PDF
Spark what's new what's coming
Databricks
 
PPTX
Apache Beam (incubating)
Apache Apex
 
PPTX
O'Reilly Media Webcast: Building Real-Time Data Pipelines
SingleStore
 
PDF
Time series data monitoring at 99acres.com
Ravi Raj
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
What's new in spark 2.0?
Örjan Lundberg
 
PDF
Spark streaming state of the union
Databricks
 
PPTX
ELK-Stack-Grid-KA-School.pptx
abenyeung1
 
PPTX
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
PDF
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
What's New in Spark 2?
Eyal Ben Ivri
 
Big data apache spark + scala
Juantomás García Molina
 
PowerStream Demo
SingleStore
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Realizing the Promise of Portable Data Processing with Apache Beam
DataWorks Summit
 
PDE2011 pythonOCC project status and plans
Thomas Paviot
 
Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - P...
StreamNative
 
Spark what's new what's coming
Databricks
 
Apache Beam (incubating)
Apache Apex
 
O'Reilly Media Webcast: Building Real-Time Data Pipelines
SingleStore
 
Time series data monitoring at 99acres.com
Ravi Raj
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
What's new in spark 2.0?
Örjan Lundberg
 
Spark streaming state of the union
Databricks
 
ELK-Stack-Grid-KA-School.pptx
abenyeung1
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Ad

Recently uploaded (20)

PDF
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
PDF
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
PPTX
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PPTX
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
GDG Cloud Southlake #44: Eyal Bukchin: Tightening the Kubernetes Feedback Loo...
James Anderson
 
FME as an Orchestration Tool with Principles From Data Gravity
Safe Software
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
Kubernetes - Architecture & Components.pdf
geethak285
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
The birth and death of Stars - earth and life science
rizellemarieastrolo
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Ad

Spark Streaming Intro @KTech

  • 1. Introto Spark Streaming (pandemic edition) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 2. image credits: @Matt Turck - Big Data Landscape 2017
  • 3. Agenda 1.Some streaming concepts (quickly) 2.Streaming models: Microbatchning vs One-record-a- Time models 3.Windowing, watermarks, state management 4.Operations on state and joins 5.Sources and Sinks Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 4. Notinthistalk » Spark as distributed compute engine » I will not cover specific integrations (like with Kafka) » I will not compare it to some specific streaming solutions Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 5. API hell - DStreams (deprecated) - Continuous mode (experimental from 2.3) - Structured Streaming (the way to go, in this talk) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 6. Streaming concepts: Data Data in motion vs data at rest (in the past) Potentially unbounded vs known size Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 7. Spark streaming - Concept » serves small batches of data collected from stream » provides them at fixed time intervals (from 0.5 secs) » performs computation image credits: Spark official doc
  • 8. Microbatching application of Bulk Synchronous Parallelism (BSP) system Consists of : 1. A split distribution of asynchronous work (tasks) 2. A synchronous barrier, coming in at fixed intervals (stages) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 9. Model: Microbatching Transforms a batch-like query into a series of incremental execution plans Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 10. One-record-at-a-time-processing Dataflow programming - computation is a graph of data flowing between operations - computations are black boxes one to-each other ( vs Catalyst in Spark) In : ApacheFlink, Google DataFlow Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 11. Model: One-record-at-a-time-processing processing user functions by pipelining - deploys functions as pipelines in a cluster - flows data through pipelines - pipelines steps are parallilized (differently, depedning on operators) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 12. Microbatchingvs One-at-a-time despite higher latency PROS: 1.sync boundaries gives the ability to adapt (f.i task recovering from failure if executor is down, scala executors etc) 2.data is available as a set at every microbatch (we can inspect, adapt, drop, get stats) 3.easier model that looks like data at rest Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 13. Spark streamingAPI » API on top of Spark SQL Dataframe,Dataset APIs // Read text from socket val socketDF = spark .readStream .format("socket") .option(...) .load() socketDF.isStreaming // Returns True for DataFrames that have streaming sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 14. Spark streamingAPI, behindthe lines [DataFrame/Dataset] => [Logical plan] => [Optimized plan] => [Series of incremental execution plans] Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 15. Triggering Run only once: val onceStream = data .writeStream .format("console") .queryName("Once") .trigger(Trigger.Once()) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 16. Triggering Scheduled execution based on processing time: val processingTimeStream = data .writeStream .format("console") .trigger(Trigger.ProcessingTime("20 seconds")) processing hasn't yet finished next batch will start immediately Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 17. Processing We can use usual Spark transformation and aggregation APIs but where's streaming semantics there ? Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 19. Processing:WindowingAPI val avgBySensorTypeOverTime = sensorStream .select($"timestamp", $"sensorType") .groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 21. Slidingwindow eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .count() image credits: @DataBricks Engineering blog
  • 22. Late events image credits: @DataBricks Engineering blog
  • 23. Watermarks "all input data with event times less than X have been observed" eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .watermark("10 minutes") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 25. Statefulprocessing Work with data in the context of what we had already seen in the stream Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 26. State management image credits: @DataBricks Engineering blog
  • 27. State managementand checkpoints Backed by S3-compatible interface to store state . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 28. Operations - State mapWithState // we produce a single result flatMapWithState // we produce 0 or N results in output Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 29. Example: Domain // Input events val weatherEvents: Dataset[WeatherEvents] // Weather station event case class WeatherEvent( stationId: String, timestamp: Timestamp, temp: Double ) // Weather avg temp output case class WeatherEventAvg( stationId: String, start: Timestamp, end: Timestamp, avgTemp: Double ) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 30. Compute using state val weatherEventsMovingAvg = weatherEvents // group by station .groupByKey(_.stationId) // processing timeout .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout) (mappingFunction) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 31. Mapping function def mappingFunction( key: String, values: Iterator[WeatherEvent], groupState: GroupState[List[WeatherEvent]] ): WeatherEventAvg = { // update the state with the new events val updatedState = ... // update the group state groupState.update(updatedState) // compute new event output using updated state WeatherEventAvg(key, ts1, ts2, tempAvg) } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 32. Writetoasinkand startthe stream // define the sink for the stream weatherEventsMovingAvg .writeStream .format("kafka") // determines that the kafka sink is used .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("checkpointLocation", "/path/checkpoint") // stream will start processing events from sources and write to sink .start() } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 33. Operations -Joins » stream join stream » stream join batch Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 34. Sources » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume » TCP sockets Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 35. Workingwith sources image credits: Stream Processing with Apache Spark @OReilly
  • 36. Offsets in checkpoints . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 37. Sinks » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume Experimentation: - Memory, Console Custom: - forEach (implement ForEachWriter to integrate with anything) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 38. Failure recovery » Spark uses checkpoints Write Ahead Log (WAL) » for Spark Streaming hwen we receive data from sources we buffer it » we need to store additional metadata to register offsets etc » we save on offset, data to be able to replay it from sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 39. "Exactlyonce" deliveryguarantee Combination of replayable sources idempotent sinks processing checkpoints Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 40. Readsand refs 1.Streaming 102:The World beyond Batch(article) by Tyler Akidau, 2016 2.Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri, O'Reilly, April 2019 3.Stream Processing with Apache Spark by Francois Garillot and Gerard Maas, O'Reilly, 2019 4.Discretized Streams: Fault-Tolerant Streaming Computation at Scale(whitepaper) by MatheiZaharia, Berkley 5.Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming by Tathagata Das, DataBricks enginnering blog Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 41. Thanks ! Oleg Korolenko for RSF Talks @Ktech, March 2020