SlideShare a Scribd company logo
Introduction to Structured
Streaming
Next Generation Streaming API for Spark
https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co
m/madhukaraphatak/examples/sparktwo/streaming
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution in Stream Processing
● Drawbacks of DStream API
● Introduction to Structured Streaming
● Understanding Source and Sinks
● Stateful stream applications
● Handling State recovery
● Joins
● Window API
Evolution of Stream Processing
Stream as Fast Batch Processing
● Stream processing viewed as low latency batch
processing
● Storm took stateless per message and spark took
minibatch approach
● Focused on mostly stateless / limited state workloads
● Reconciled using Lamda architecture
● Less features and less powerful API compared to the
batch system
● Ex : Storm, Spark DStream API
Drawbacks of Stream as Fast Batch
● Handling state for long time and efficiently is a
challenge in these systems
● Lambda architecture forces the duplication of efforts in
stream and batch
● As the API is limited, doing any kind of complex
operation takes lot of effort
● No clear abstractions for handling stream specific
interactions like late events, event time, state recovery
etc
Stream as the default abstraction
● Stream becomes the default abstraction on which both
stream processing and batch processing is built
● Batch processing is looked at as bounded stream
processing
● Supports most of the advanced stream processing
constructs out of the box
● Strong state API’s
● In par with functionalities of Batch API
● Ex : Flink, Beam
Challenges with Stream as default
● Stream as abstraction makes it hard to combine stream
with batch data
● Stream abstraction works well for piping based API’s
like map, flatMap but challenging for SQL
● Stream abstraction also sometimes make it difficult to
map it to structured world as in the platform level it’s
viewed as byte stream
● There are efforts like flink SQL but we have to wait how
it turns out
Drawbacks of DStream API
Tied to Minibatch execution
● DStream API looks stream as fast batch processing in
both API and runtime level
● Batch time integral part of the API which makes it
minibatch only API
● Batch time dicates how different abstractions of API like
window and state will behave
RDDs based API
● DStream API is based on RDD API which is deprecated
for user API’s in Spark 2.0
● As DStream API uses RDD, it doesn’t get benefit of the
all runtime improvements happened in spark sql
● Difficult to combine in batch API’s as they use Dataset
abstraction
● Running SQL queries over stream are awkward and not
straight forward
Limited support for Time abstraction
● Only supports the concept of Processing time
● No support for ingestion time and event time
● As batch time is defined at application level, there is no
framework level construct to handle late events
● Windowing other than time, is not possible
Introduction to Structured Streaming
Stream as the infinite table
● In structured streaming, a stream is modeled as an
infinite table aka infinite Dataset
● As we are using structured abstraction, it’s called
structured streaming API
● All input sources, stream transformations and output
sinks modeled as Dataset
● As Dataset is underlying abstraction, stream
transformations are represented using SQL and Dataset
DSL
Advantage of Stream as infinite table
● Structured data analysis is first class not layered over
the unstructured runtime
● Easy to combine with batch data as both use same
Dataset abstraction
● Can use full power of SQL language to express stateful
stream operations
● Benefits from SQL optimisations learnt over decades
● Easy to learn and maintain
Source and Sinks API
Reading from Socket
● Socket is built in source for structured streaming
● As with DStream API, we can read socket by specifying
hostname and port
● Returns a DataFrame with single column called value
● Using console as the sink to write the output
● Once we have setup source and sink, we use query
interface to start the execution
● Ex : SocketReadExample
Questions from DStream users
● Where is batch time? Or how frequently this is going to
run?
● awaitTermination is on query not on session? Does
that mean we can have multiple queries running
parallely?
● We didn't specify local[2], how does that work?
● As this program using Dataframe, how does the schema
inference works?
Flink vs Spark stream processing
● Spark run as soon as possible may sound like per event
processing but it’s not
● In flink, all the operations like map / flatMap will be
running as processes and data will be streamed through
it
● But in spark asap, tasks are launched for given batch
and destroyed once it’s completed
● So spark still does minibatch but with much lower
latency.
Flink Operator Graph
Spark Execution Graph
a b
1 2
3 4
Batch 1
Map
Stage
Aggregat
e Stage
Spawn tasks
Batch 2
Map
Stage
Aggregat
e Stage
Sink
Stage
Sink
Stage
Spawn tasks
Socket Stream
Independence from Execution Model
● Even though current structured streaming runtime is
minibatch, API doesn’t dictate the nature of runtime
● Structured Streaming API is built in such a way that
query execution model can be change in future
● Already plan for continuous processing mode to bring
structured streaming in par with flink per message
semantics
● https://issues.apache.org/jira/browse/SPARK-20928
Socket Minibatch
● In last example, we used asap trigger.
● We can mimic the DStream mini batch behaviour by
changing the trigger API
● Trigger is specified for the query, as it determines the
frequency for query execution
● In this example, we create a 5 second trigger, which will
create a batch for every 5 seconds
● Ex : SocketMiniBatchExample
Word count on Socket Stream
● Once we know how to read from a source, we can do
operations on the same
● In this example, we will do word count using Dataframe
and Dataset API’s
● We will be using Dataset API’s for data
cleanup/preparation and Dataframe API to define the
aggregations
● Ex : SocketWordCount
Understanding State
Stateful operations
● In last example, we observed that spark remembers the
state across batches
● In structured streaming, all aggregations are stateful
● Developer needs to choose output mode complete so
that aggregations are always up to date
● Spark internally uses the both disk and memory state
store for remembering state
● No more complicated state management in application
code
Understanding output mode
● Output mode defines what’s the dataframe seen by the
sink after each batch
● APPEND signifies sink only sees the records from last
batch
● UPDATE signifies sink sees all the changed records
across the batches
● COMPLETE signifies sink sess complete output for
every batch
● Depending on operations, we need to choose output
mode
Stateless aggregations
● Most of the stream applications benefit from default
statefulness
● But sometime we need aggregations done on batch
data rather on complete data
● Helpful for porting existing DStream code to structured
streaming code
● Spark exposes flatMapGroups API to define the
stateless aggregations
Stateless wordcount
● In this example, we will define word count on a batch
data
● Batch is defined for 5 seconds.
● Rather than using groupBy and count API’s we will use
groupByKey and flatMapGroups API
● flatMapGroups defines operations to be done on each
group
● We will be using output mode APPEND
● Ex : StatelessWordCount
Limitations of flatMapGroups
● flatMapGroups will be slower than groupBy and count
as it doesn’t support partial aggregations
● flatMapGroups can be used only with output mode
APPEND as output size of the function is unbounded
● flatMapGroups needs grouping done using Dataset API
not using Dataframe API
Checkpoint and state recovery
● Building stateful applications comes with additional
responsibility of checkpointing the state for safe
recovery
● Checkpointing is achieved by writing state of the
application to a HDFS compatible storage
● Checkpointing is specific for queries. So you can mix
and match stateless and stateful queries in same
application
● Ex : RecoverableAggregation
Working with Files
File streams
● Structured Streaming has excellent support for the file
based streams
● Supports file types like csv, json,parquet out of the box
● Schema inference is not supported
● Picking up new files on arrival is same as DStream file
stream API
● Ex : FileStreamExample
Joins with static data
● As Dataset is common abstraction across batch and
stream API’s , we can easily enrich structured stream
with static data
● As both have schema built in, spark can use the catalyst
optimiser to optimise joins between files and streams
● In our example,we will be enriching sales stream with
customer data
● Ex : StreamJoin
References
● http://blog.madhukaraphatak.com/categories/introductio
n-structured-streaming/
● https://databricks.com/blog/2017/01/19/real-time-stream
ing-etl-structured-streaming-apache-spark-2-1.html
● https://flink.apache.org/news/2016/05/24/stream-sql.htm
l

More Related Content

What's hot (20)

PDF
Core Services behind Spark Job Execution
datamantra
 
PDF
Understanding time in structured streaming
datamantra
 
PDF
Introduction to Flink Streaming
datamantra
 
PDF
Introduction to concurrent programming with Akka actors
Shashank L
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Interactive Data Analysis in Spark Streaming
datamantra
 
PPTX
Building real time Data Pipeline using Spark Streaming
datamantra
 
PDF
Building distributed processing system from scratch - Part 2
datamantra
 
PPTX
State management in Structured Streaming
datamantra
 
PDF
Interactive workflow management using Azkaban
datamantra
 
PDF
Understanding Implicits in Scala
datamantra
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Spark on Kubernetes
datamantra
 
PDF
Introduction to Spark 2.0 Dataset API
datamantra
 
PPTX
Multi Source Data Analysis using Spark and Tellius
datamantra
 
PDF
Introduction to spark 2.0
datamantra
 
PDF
Building Distributed Systems from Scratch - Part 1
datamantra
 
PDF
Introduction to dataset
datamantra
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Core Services behind Spark Job Execution
datamantra
 
Understanding time in structured streaming
datamantra
 
Introduction to Flink Streaming
datamantra
 
Introduction to concurrent programming with Akka actors
Shashank L
 
Productionalizing a spark application
datamantra
 
Interactive Data Analysis in Spark Streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
datamantra
 
Building distributed processing system from scratch - Part 2
datamantra
 
State management in Structured Streaming
datamantra
 
Interactive workflow management using Azkaban
datamantra
 
Understanding Implicits in Scala
datamantra
 
Building end to end streaming application on Spark
datamantra
 
Spark on Kubernetes
datamantra
 
Introduction to Spark 2.0 Dataset API
datamantra
 
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Introduction to spark 2.0
datamantra
 
Building Distributed Systems from Scratch - Part 1
datamantra
 
Introduction to dataset
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
A Tool For Big Data Analysis using Apache Spark
datamantra
 

Viewers also liked (10)

PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PPTX
Structured Streaming Using Spark 2.1
Sigmoid
 
PDF
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
PDF
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
PDF
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
ODP
Introduction to Structured Streaming
Knoldus Inc.
 
PDF
Making Structured Streaming Ready for Production
Databricks
 
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Structured Streaming Using Spark 2.1
Sigmoid
 
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Introduction to Structured Streaming
Knoldus Inc.
 
Making Structured Streaming Ready for Production
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Ad

Similar to Introduction to Structured streaming (20)

PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Structured streaming in Spark
Giri R Varatharajan
 
PDF
Introduction to Spark Streaming
datamantra
 
ODP
Understanding Spark Structured Streaming
Knoldus Inc.
 
PPTX
Apache Spark Streaming
Zahra Eskandari
 
PPTX
Meetup spark structured streaming
José Carlos García Serrano
 
PDF
Introduction to Apache Flink
datamantra
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPT
Spark streaming
Venkateswaran Kandasamy
 
PDF
So you think you can stream.pptx
Prakash Chockalingam
 
PDF
Apache Spark Overview part2 (20161117)
Steve Min
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PPTX
Flink Streaming
Gyula Fóra
 
PDF
Productizing Structured Streaming Jobs
Databricks
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
PDF
Introduction to Apache Spark 2.0
Knoldus Inc.
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache Spark Components
Girish Khanzode
 
Structured streaming in Spark
Giri R Varatharajan
 
Introduction to Spark Streaming
datamantra
 
Understanding Spark Structured Streaming
Knoldus Inc.
 
Apache Spark Streaming
Zahra Eskandari
 
Meetup spark structured streaming
José Carlos García Serrano
 
Introduction to Apache Flink
datamantra
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Spark streaming
Venkateswaran Kandasamy
 
So you think you can stream.pptx
Prakash Chockalingam
 
Apache Spark Overview part2 (20161117)
Steve Min
 
strata_spark_streaming.ppt
rveiga100
 
Flink Streaming
Gyula Fóra
 
Productizing Structured Streaming Jobs
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
 
Introduction to Apache Spark 2.0
Knoldus Inc.
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Ad

More from datamantra (13)

PDF
Understanding transactional writes in datasource v2
datamantra
 
PDF
Optimizing S3 Write-heavy Spark workloads
datamantra
 
PDF
Spark stack for Model life-cycle management
datamantra
 
PDF
Testing Spark and Scala
datamantra
 
PDF
Scalable Spark deployment using Kubernetes
datamantra
 
PDF
Introduction to concurrent programming with akka actors
datamantra
 
PDF
Functional programming in Scala
datamantra
 
PPTX
Telco analytics at scale
datamantra
 
PPTX
Platform for Data Scientists
datamantra
 
PDF
Building scalable rest service using Akka HTTP
datamantra
 
PDF
Real time ETL processing using Spark streaming
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Anatomy of spark catalyst
datamantra
 
Understanding transactional writes in datasource v2
datamantra
 
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
datamantra
 
Testing Spark and Scala
datamantra
 
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
datamantra
 
Telco analytics at scale
datamantra
 
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
datamantra
 
Real time ETL processing using Spark streaming
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
datamantra
 

Recently uploaded (20)

PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
big data eco system fundamentals of data science
arivukarasi
 
Research Methodology Overview Introduction
ayeshagul29594
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SQL for Accountants and Finance Managers
ysmaelreyes
 

Introduction to Structured streaming

  • 1. Introduction to Structured Streaming Next Generation Streaming API for Spark https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co m/madhukaraphatak/examples/sparktwo/streaming
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution in Stream Processing ● Drawbacks of DStream API ● Introduction to Structured Streaming ● Understanding Source and Sinks ● Stateful stream applications ● Handling State recovery ● Joins ● Window API
  • 4. Evolution of Stream Processing
  • 5. Stream as Fast Batch Processing ● Stream processing viewed as low latency batch processing ● Storm took stateless per message and spark took minibatch approach ● Focused on mostly stateless / limited state workloads ● Reconciled using Lamda architecture ● Less features and less powerful API compared to the batch system ● Ex : Storm, Spark DStream API
  • 6. Drawbacks of Stream as Fast Batch ● Handling state for long time and efficiently is a challenge in these systems ● Lambda architecture forces the duplication of efforts in stream and batch ● As the API is limited, doing any kind of complex operation takes lot of effort ● No clear abstractions for handling stream specific interactions like late events, event time, state recovery etc
  • 7. Stream as the default abstraction ● Stream becomes the default abstraction on which both stream processing and batch processing is built ● Batch processing is looked at as bounded stream processing ● Supports most of the advanced stream processing constructs out of the box ● Strong state API’s ● In par with functionalities of Batch API ● Ex : Flink, Beam
  • 8. Challenges with Stream as default ● Stream as abstraction makes it hard to combine stream with batch data ● Stream abstraction works well for piping based API’s like map, flatMap but challenging for SQL ● Stream abstraction also sometimes make it difficult to map it to structured world as in the platform level it’s viewed as byte stream ● There are efforts like flink SQL but we have to wait how it turns out
  • 10. Tied to Minibatch execution ● DStream API looks stream as fast batch processing in both API and runtime level ● Batch time integral part of the API which makes it minibatch only API ● Batch time dicates how different abstractions of API like window and state will behave
  • 11. RDDs based API ● DStream API is based on RDD API which is deprecated for user API’s in Spark 2.0 ● As DStream API uses RDD, it doesn’t get benefit of the all runtime improvements happened in spark sql ● Difficult to combine in batch API’s as they use Dataset abstraction ● Running SQL queries over stream are awkward and not straight forward
  • 12. Limited support for Time abstraction ● Only supports the concept of Processing time ● No support for ingestion time and event time ● As batch time is defined at application level, there is no framework level construct to handle late events ● Windowing other than time, is not possible
  • 14. Stream as the infinite table ● In structured streaming, a stream is modeled as an infinite table aka infinite Dataset ● As we are using structured abstraction, it’s called structured streaming API ● All input sources, stream transformations and output sinks modeled as Dataset ● As Dataset is underlying abstraction, stream transformations are represented using SQL and Dataset DSL
  • 15. Advantage of Stream as infinite table ● Structured data analysis is first class not layered over the unstructured runtime ● Easy to combine with batch data as both use same Dataset abstraction ● Can use full power of SQL language to express stateful stream operations ● Benefits from SQL optimisations learnt over decades ● Easy to learn and maintain
  • 17. Reading from Socket ● Socket is built in source for structured streaming ● As with DStream API, we can read socket by specifying hostname and port ● Returns a DataFrame with single column called value ● Using console as the sink to write the output ● Once we have setup source and sink, we use query interface to start the execution ● Ex : SocketReadExample
  • 18. Questions from DStream users ● Where is batch time? Or how frequently this is going to run? ● awaitTermination is on query not on session? Does that mean we can have multiple queries running parallely? ● We didn't specify local[2], how does that work? ● As this program using Dataframe, how does the schema inference works?
  • 19. Flink vs Spark stream processing ● Spark run as soon as possible may sound like per event processing but it’s not ● In flink, all the operations like map / flatMap will be running as processes and data will be streamed through it ● But in spark asap, tasks are launched for given batch and destroyed once it’s completed ● So spark still does minibatch but with much lower latency.
  • 21. Spark Execution Graph a b 1 2 3 4 Batch 1 Map Stage Aggregat e Stage Spawn tasks Batch 2 Map Stage Aggregat e Stage Sink Stage Sink Stage Spawn tasks Socket Stream
  • 22. Independence from Execution Model ● Even though current structured streaming runtime is minibatch, API doesn’t dictate the nature of runtime ● Structured Streaming API is built in such a way that query execution model can be change in future ● Already plan for continuous processing mode to bring structured streaming in par with flink per message semantics ● https://issues.apache.org/jira/browse/SPARK-20928
  • 23. Socket Minibatch ● In last example, we used asap trigger. ● We can mimic the DStream mini batch behaviour by changing the trigger API ● Trigger is specified for the query, as it determines the frequency for query execution ● In this example, we create a 5 second trigger, which will create a batch for every 5 seconds ● Ex : SocketMiniBatchExample
  • 24. Word count on Socket Stream ● Once we know how to read from a source, we can do operations on the same ● In this example, we will do word count using Dataframe and Dataset API’s ● We will be using Dataset API’s for data cleanup/preparation and Dataframe API to define the aggregations ● Ex : SocketWordCount
  • 26. Stateful operations ● In last example, we observed that spark remembers the state across batches ● In structured streaming, all aggregations are stateful ● Developer needs to choose output mode complete so that aggregations are always up to date ● Spark internally uses the both disk and memory state store for remembering state ● No more complicated state management in application code
  • 27. Understanding output mode ● Output mode defines what’s the dataframe seen by the sink after each batch ● APPEND signifies sink only sees the records from last batch ● UPDATE signifies sink sees all the changed records across the batches ● COMPLETE signifies sink sess complete output for every batch ● Depending on operations, we need to choose output mode
  • 28. Stateless aggregations ● Most of the stream applications benefit from default statefulness ● But sometime we need aggregations done on batch data rather on complete data ● Helpful for porting existing DStream code to structured streaming code ● Spark exposes flatMapGroups API to define the stateless aggregations
  • 29. Stateless wordcount ● In this example, we will define word count on a batch data ● Batch is defined for 5 seconds. ● Rather than using groupBy and count API’s we will use groupByKey and flatMapGroups API ● flatMapGroups defines operations to be done on each group ● We will be using output mode APPEND ● Ex : StatelessWordCount
  • 30. Limitations of flatMapGroups ● flatMapGroups will be slower than groupBy and count as it doesn’t support partial aggregations ● flatMapGroups can be used only with output mode APPEND as output size of the function is unbounded ● flatMapGroups needs grouping done using Dataset API not using Dataframe API
  • 31. Checkpoint and state recovery ● Building stateful applications comes with additional responsibility of checkpointing the state for safe recovery ● Checkpointing is achieved by writing state of the application to a HDFS compatible storage ● Checkpointing is specific for queries. So you can mix and match stateless and stateful queries in same application ● Ex : RecoverableAggregation
  • 33. File streams ● Structured Streaming has excellent support for the file based streams ● Supports file types like csv, json,parquet out of the box ● Schema inference is not supported ● Picking up new files on arrival is same as DStream file stream API ● Ex : FileStreamExample
  • 34. Joins with static data ● As Dataset is common abstraction across batch and stream API’s , we can easily enrich structured stream with static data ● As both have schema built in, spark can use the catalyst optimiser to optimise joins between files and streams ● In our example,we will be enriching sales stream with customer data ● Ex : StreamJoin